Introduction

This is an R Markdown Notebook with exercises written for the workshop Validation – Experimental Design and Analysis Using STR-validator held at the 29th Congress of the International Society for Forensic Genetics (ISFG) in Washington, DC, Tuesday August 30th 2022 at 14:00 - 18:00.

These exercises are written using STR-validator version 2.4 running in a 64-bit R version 4.2.1 on a Windows 10 operating system. Screenshots may vary for other systems and the procedure may be different for other versions of STR-validator. The exercise has an easy difficulty level with no quality issues or errors. Each exercise can be performed using the STR-validator graphical user interface, or in the next section directly in R by command line using the strvalidator package. To aid the learning process identical names are used for data sets in STR-validator and the command line exercises.

The following software is required

The statistical software R: https://cloud.r-project.org/
The integrated development environment RStudio Desktop: https://www.rstudio.com/products/rstudio/download/#download
The R package strvalidator version 2.4.0:
1. Important! Version 2.4.0 is not yet on CRAN as of 30th of August. Ignore step a. and use the alternative installation method in step b. to install from GitHub instead of the the usual method described here: Install the published version from The Comprehensive R Archive Network (CRAN) by typing this in the Console window of RStudio:
```
install.packages("strvalidator", dependencies = TRUE)    
```
1. Alternatively, install the latest developer version from GitHub. First you need to install the devtools package from CRAN. Type the commands in the Console window of RStudio and hit Enter after each command. Still in the Console, load the devtools package and install strvalidator from GitHub:
```
install.packages("devtools", dependencies = TRUE)    
```
```
library(devtools)
install_github("OskarHansson/strvalidator")
```
1. Confirm successful installation of version 2.4.0 by typing:
```
library(strvalidator)
```
```
## STR-validator 2.4.0 loaded!
```
1. To open the graphical user interface type (The window may open “behind” RStudio):
```
strvalidator()    
```

Installation instructions can also be found at the STR-validator webpage.

Using the strvalidator package or its graphical user interface STR-validator:

There are multiple ways of setting up and running STR-validator. Either of the below alternatives can be used. The benefit alternative a. is if you also perform other tasks in R. RStudio is recommended for new users, however plots and viewed tables will show up within the RStudio Plots and Viewer tabs respectively. In alternative b. plots and tables will always show up in separate windows.

Through RStudio, RGui, or any other graphical user interface to R.
1. Open the R software of your choice by double click its program icon.
2. Load the strvalidator R package by typing the following code in the console window. Press [Enter] to execute the command.
```
library(strvalidator)
```
1. To open the STR-validator graphical user interface type (the window may open behind RStudio):
```
strvalidator()    
```
The STR-validator graphical user interface can be set up to launch from an icon, as a stand alone software. The instructions below are for RGui. The procedure in RStudio is slightly different. However, it is always the RGui icon that is modified in the end.
1. Open up a new session of RGui.
2. Clear the workspace using the menu Misc/Remove all objects, or the R command:
```
rm(list=ls(all=TRUE))
```
1. Open a new script window using the menu File/New script.
2. In the new script window type or copy paste this code:
```
.First <- function(){
  library(strvalidator)
  strvalidator()
}
```
1. Load the function using the menu Edit/Run all, or [Ctrl]+[A] to mark the code and [Ctrl]+[R] to run it.
2. Save the workspace as .RData (i.e. just a “file extension”) in your Documents folder using the menu File/Save workspace….
3. Copy the RGui shortcut/icon to the desktop.
4. Right click the RGui shortcut located on the desktop and select Properties.
5. Select the Shortcut tab.
6. Completely clear the text field Start in:.
7. Locate the Target field containing the full path to the RGui executable file. It looks similar to this:
"C:\Program Files\R\R-4.2.1\bin\x64\Rgui.exe" –cd-to-userdocs
1. Change the Rgui.exe part of the path to Rterm.exe.
2. Click [OK] to exit and save.
3. Name the shortcut STR-validator.
4. To open STR-validator all you have to do is double click the new STR-validator icon. NB! Closing the R terminal window will also close STR-validator. Minimize it and use it for progress information and diagnostics. STR-validator will print useful messages to the terminal window.

Example Data

The authentic direct amplification data of FTA punches used for the exercises was kindly provided by Siv Gilfillan, Forensic Department, Oslo University Hospital. The DNA profiles have been anonymized by scrambling the alleles.

The input files can be downloaded from the https://sites.google.com/site/forensicapps/strvalidator/2022-isfg.

Conventions

[Bold] indicate buttons in the STR-validator graphical user interface.

Italic indicate text or selectable options in the STR-validator graphical user interface.

Italic_and_bold indicate data sets within the STR-validator graphical user interface, the R global environment, or file or folder names within the operating system.

“citation signs” indicate text written in option fields or text that should be typed into option fields.

STR-validator refer to the graphical user interface of the strvalidator R package. STR-validator is opened by the command strvalidator().

1 Estimation of Analytical Thresholds

The analytical threshold should be based on signal-to-noise analyses of internally derived empirical data. An analytical threshold defines the minimum height requirement at and above which detected peaks can be reliably distinguished from background noise. Because the analytical threshold is based upon a distribution of noise values, it is expected that occasional, non-reproducible noise peaks may be detected above the analytical threshold. Usage of an exceedingly high analytical threshold increases the risk of allelic data loss. SWGDAM Interpretation Guidelines for Autosomal STR Typing by Forensic DNA Testing Laboratories (APPROVED 01/12/2017).

1.1 Experimental Design

To estimate analytical thresholds non-template PCR controls or positive PCR controls, or samples with known profiles can be used. A higher noise level is expected with DNA containing samples. It is important to mask the input data from noise directly related the ILS (in negative controls), or directly related to known alleles (in positive controls and samples). This is done by excluding signals close to the ILS or in the stutter range from the calculation. Likewise, it is important to verify the assumption of normally distributed data under the method of choice. When the assumptions are met, a risk level can be assign to different AT so that an informed decision can be made about the AT to implement. To cover all capillaries of the instrument a full plate can be run. The level of noise is in our experience relatively stable across time.

The data set in this exercise is from 8 negative (non-template) PCR controls amplified using the SureID27 kit according to recommendations.

The results from the capillary electrophoresis were analysed in GeneMapper using a peak amplitude threshold (PAT) of 1 RFU. No stutter filter or global cut-off was applied.

During manual inspection of the EPGs it was noted that weak pull-up or bleed-through from the internal lane standard (ILS) were common in the yellow and purple dye channel for negative controls (<10 data points from the corresponding ILS peak).

1.2 Learning Outcome

This exercise is estimated to approximately 10 minutes and will teach:

Single file import with automatic trimming of ladders and conversion to the STR-validator format.
Saving the STR-validator project.
Estimation of analytical thresholds from negative controls, excluding evaluation of assumptions for the distribution of noise (Mönich et al. 2015).
Plotting of examples and saving images.

Additional Learning Resources

STR-validator excercise: For a more comprehensive exercise that additionally teaches data manipulation tools of STR-validator, analysis of noise from positive controls, and verification of assumptions for the distribution of noise, see exercise analytical_thresholds in ghep-isfg2018_exercises_easy.zip available at the STR-validator GHEP-ISFG 2018 webpage.

STR-validator video tutorial: Estimation of analytical thresholds (2017).

Publications: Mönich et al. (2015), Bregu et al. (2013), and Rakay, Bregu, and Grgicak (2012)

1.3 Data analysis

The data is available as an exported GeneMapper SamplePlotSizingTable (refer to the guideline Estimate Analytical Thresholds Using STR-validator for instructions) named sureid_neg_SamplePlotSizingTable.txt

1.3.1 Analysis using STR-validator

Open STR-validator graphical user interface.
Import the data set into STR-validator.
1. Select the Workspace tab and click the [Import] button.
2. Use the [Select file] button to locate the file in the input_files folder. Select the file sureid_neg_SamplePlotSizingTable.txt and click [OK].
3. Check the Auto trim samples option to remove the allelic ladders. Change the search string to “neg” and uncheck the Invert (remove matching samples) option.
4. The Auto slim repeated columns option can be unchecked.
5. The filename “sureid_neg” is suggested as Name in the Save options group.
6. Click the [Import] button.
  
  The import from files dialogue.
Save the project.
1. Select the Workspace tab and click the [Save As] button.
2. Browse to the desired folder where you want to save your project and click the [Select folder] button.
3. The Save as dialog opens. Give the project a suitable name and click [ok]. A confirmation dialog is shown to indicate successful saving of the project (the save process can take some time). Click [OK] to dismiss.
4. It is advised to regularly click the [Save] button. The save path is remembered within each project. If the project is moved to another location, use the [Save As] button to update the path.
Estimate analytical thresholds.
1. Select the AT tab and click the [Calculate] button, to Calculate analytical threshold (AT1, AT2, AT4, AT7).
2. Select the sureid_neg data set.
3. A reference data set is not needed for negative control samples.
4. The kit should be automatically detected as SureID27.
5. The Ignore case option can be left checked, and the Add word boundaries option can be left unchecked. These are name matching options for the reference data set.
6. Check the option to Mask high peaks and set the corresponding value Mask all peaks above (RFU) to “200” (a value well above any noise signal).
7. Uncheck the option to Mask sample alleles. This is used when a reference data set is provided.
8. Check the Mask ILS peaks option and set the corresponding value Range around known peak to “10” (the range should come from the manual inspection in GeneMapper to include pull-up peaks from the internal lane standard).
9. Leave the thresholds values to their default. The analysis can be repeated using different values.
10. Manually inspect the masking by first clicking the [Prepare and mask] button. Then select a sample to plot in the drop-down list. In RStudio the plot appears in the Plots tab in the lower right corner. If Rgui is used the plot appears in its own window. Pay attention to masked areas and excluded peaks. Select a few other random samples to inspect. You should be confident that the mask settings exclude any pull-up and sporadic high peaks from the analysis.
11. Select a representative plot and click the [Save plot] button.
  1. Set the file extension to png image.
  2. Uncheck the Overwrite existing file and Load size from plot device option.
  3. Manually set the Width to “40” and Height to “20” cm. Leave the default “300” for Resolution and “1” for Scaling factor.
  4. Click [file] to locate a folder to save the image (e.g. output_files) and click [Select Folder]. Click the [Save] button to save the image.
  5. The settings for the Save as image dialogue will be remembered.
12. Accept the default names for the result files and click the [Calculate] button.
  
  The calculate analytical threshold dialogue.
Find the estimated ATs.

Note: Evaluation of the assumptions for the distribution of noise should be performed before selecting a method to estimate AT. In this example, method AT7 is the preferred since the transformed data was approximately normally distributed.
1. Click the [View] button.
2. Select the sureid_neg_at data set.
3. Maximize the window and scroll to the right. The most interesting columns are the Global.AT7 which is the AT7 method estimate using the noise signals from all samples, and the estimates per dye Global.B.AT7, Global.G.AT7, Global.Y.AT7, and Global.R.AT7.
4. The result can be copied or exported to a text file or spreadsheet software.
  
  The estimated analytical thresholds.

1.3.2 Analysis using the strvalidator package

Open RStudio and a new R Script document.
Load the strvalidator package:
```
library(strvalidator)
```

Import the data set to a variable using the import function. The import.file parameter should be modified to point to the file to be imported.

sureid_neg <- import(import.file = params$path.to.sample.plot.sizing.table,
                  file.name = TRUE, time.stamp = TRUE,
                  auto.trim = TRUE, trim.samples = "neg", trim.invert = FALSE,
                  auto.slim = FALSE)

Estimate analytical thresholds using the calculateAT function:

sureid_neg_res <- calculateAT(data = sureid_neg, ref = NULL, mask.height = TRUE, height = 200,
                         mask.sample = FALSE, per.dye = TRUE, range.sample = 20,
                         mask.ils = TRUE, range.ils = 10, k = 3, rank.t = 0.99, alpha = 0.01,
                         ignore.case = TRUE, word = FALSE)

The result is a list of three data.frames namely: 1) the estimated analytical thresholds, 2) the ranked list of noise signals, and 3) the input data with masking information. To print the result we need to get the first data.frame from the result list. Then extract estimates for the method of choice, in this case AT7, and print the result.

# Extract the first data.frame from the list of results.
sureid_neg_at <- sureid_neg_res[[1]] 

# Extract the first row in the data.frame.
sureid_neg_at <- sureid_neg_at[1,]

# Extract all columns containing "AT7".
sureid_neg_at <- sureid_neg_at[,grepl(names(sureid_neg_at), pattern = "AT7")]

# Extract all columns containing "Global".
sureid_neg_at <- sureid_neg_at[,grepl(names(sureid_neg_at),pattern = "Global")]

# Show values rounded to one decimal.
print(round(sureid_neg_at,1))

##   Global.AT7 Global.B.AT7 Global.G.AT7 Global.Y.AT7 Global.R.AT7
## 1         39         25.4         29.3         34.7         53.1

2 Estimation of Allele Sizing Precision

For verification of electrophoretic equipment it is recommended that “the precision of the instrument should be such that all measured alleles fall within a +/- 0.5 bp window around the measured size for the corresponding allele in the allelic ladder”. ENFSI Recommended Minimum Criteria for the Validation of Various Aspects of the DNA Profiling Process (ISSUE DATE: November 2010).

2.1 Experimental Design

Allele Sizing Precision can be estimated by running allelic ladders as samples. For example, in order to cover all capillaries of the instrument, run at least one full injection of allelic ladders. Repeat the experiment at a different time. Analyse all ladders together to estimate the precision of the instrument.

The data set in this exercise is from 4 allelic ladders from the SureID27 kit.

The results from the capillary electrophoresis were analysed in GeneMapper using the previously determined analytical threshold (AT).

2.2 Learning Outcome

This exercise is estimated to approximately 15 minutes and will teach:

Single file import with automatic trimming of controls and automatic conversion to the STR-validator format.
Filtering of data by kit bins.
Analysis of allele sizing precision of the allelic ladder.
Calculation of summary statistics.
Calculate and adding size difference to the result.
Plotting of allele sizing data.

Other Learning Resources

STR-validator video tutorial: Estimation of allele sizing precision (2017).

Publications: Ensenberger et al. (2016)

2.3 Data Analysis

The data is available as an exported GeneMapper Genotypes Table named sureid_ladder.txt

2.3.1 Analysis using STR-validator

Open STR-validator graphical user interface.
Import the data set into STR-validator.
1. Select the Workspace tab and click the [Import] button.
2. Use the [Select file] button to locate the file in the input_files folder. Select the file sureid_ladder.txt and click [OK].
3. Check the Auto trim samples option to extract the allelic ladders. Change the search string to “ladder”. Uncheck the Invert option to extract sample names containing the string “ladder”.
4. Check the Auto slim repeated columns option.
5. The filename “sureid_ladder” is suggested as Name in the Save options group.
6. Click the [Import] button. Tip: the R terminal window show the path to the imported file, which is especially useful when importing multiple files.
  
  The Import from file dialogue.
Filter the data
1. Select the Precision tab and click the [Filter] button.
2. Select the sample data set sureid_ladder. The kit SureID27 should be automatically detected.
3. Select the filter option Filter by kit bins (allelic ladder) to remove any peak other than what is defined in the kit bins. The Reference sample name matching option Ignore case can be left checked, while Excact matching and Add word boundaries can be unchecked.
4. The option to Exclude virtual bins¹ should be checked to retain peaks that correspond to physical fragments present in the ladder.
5. Uncheck any pre-processing and post-processing options.
6. Accept the default name for the result and click the [Filter] button.
  
  The Filter profile dialogue.
Calculate summary statistics
1. In the Precision tab, click the [Statistics] button to Calculate summary statistics for Size.
2. The sureid_ladder_filter (i.e. most recent) data set should be automatically selected.
3. The target column should be set to Size (Repeat the process if you want to calculate for Height or Data.Point. There are also dedicated buttons that opens Calculate summary statistics with pre-loaded options).
4. The Group by column(s) should be set to “Marker,Allele”. Note: if entered manually the column names must be separated by a comma, without spaces.
5. Leave the Count unique values in column drop down menu unselected. The Calculate quantile and Round to decimals can be left with the default values.
6. Accept the default name for the result and click the [Calculate] button.
  
  The Calculate summary statistics dialogue.
Calculate and add the absolute difference
1. Select the Tools tab and click the [Columns] button.
2. Select the sureid_ladder_filter_stats data set.
3. Select Size.Max as column 1 and Size.Min as column 2.
4. Type “Size.Diff” as Column for new values.
5. Select the Action “-” (subtract).
6. Remove the appended “_new” in the Name for result to overwrite the data set.
7. Click the [Execute] button. Click [Yes] on the question to overwrite.
  
  The Column actions dialogue.
Locate the worst sizing precision
1. Select the Workspace tab.
2. Select the sureid_ladder_filter_stats data set.
3. Click the [View] button.
4. In RStudio the dataset is shown in the Viewer tab in the lower right pane. Click the button Show in new window (an arrow on a window icon) for larger table (alternatively, use the Zoom icon).
5. Click the column header Size.Diff to sort in ascending order. Click again to sort in descending order. The minimum and maximum absolute difference respectively, can be read at the first row, and the corresponding marker in the Marker column. The sorted tables can be exported to different formats.
  
  The interactive table viewer.
Plot sizing precision
1. Select the Precision tab and click the [Plot] button.
2. Select the sureid_ladder_filter data set. Tip: if you forgot to filter the data set, and there are “OL” alleles in the data set, a warning will be shown.
3. Check the Plot by marker option.
4. The X-axis should be Allele.
5. Change the Plot theme to theme_bw(). Further customization can be done by expanding the Data points, Axes, and Override default x/y/facet labels option groups.
6. Click the [Size] button in the Plot precision data as dotplot group. If we have more data per allele, we could use a boxplot instead. In RStudio the plot is shown in the Plots tab located in the lower right pane. Click the Zoom button to view the plot in a larger window. It is possible to export the plot from RStudio. Alternatively, the plot can be exported from STR-validator as described below.
7. In the Plot precision window, click the [Save as image] button.
  1. Set the file extension to png image.
  2. Uncheck the Overwrite existing file and Load size from plot device options.
  3. Manually set the Width to “40” and Height to “20” cm. Leave the default “300” for Resolution and “1” for Scaling factor.
  4. Click [file] to locate a folder to save the image (e.g. output_files) and click [Select Folder]. Click the [Save] button to save the image.
  5. The settings for the Save as image dialogue will be remembered.
8. (Optional) In the Plot precision window, click the [Save as object] button to save the plot object in the workspace. The object can be viewed or manipulated at a later time.
9. Close the Plot precision window.
  
  The Plot precision dialogue.
  
  Example of a precision plot.

2.3.2 Analysis using the strvalidator package

Open RStudio and a new R Script document.
Load the strvalidator package:
```
library(strvalidator)
```

Import the data set to a variable using the import function. The import.file parameter should be modified to point to the file to be imported.

sureid_ladder <- import(import.file = params$path.to.precision.table,
                file.name = TRUE, time.stamp = TRUE,
                auto.trim = TRUE, trim.samples = "ladder", trim.invert = FALSE,
                auto.slim = TRUE)

Filter any additional peaks using the filterProfile function:

# Get markers, bins and flag for virtual bins.
ref <- getKit(kit = "SureID27", what = "VIRTUAL")

# Extract physical bins.
ref <- ref[ref$Virtual == 0, ]

# Filter using the bins as known profile.  
sureid_ladder_filter <- filterProfile(data = sureid_ladder, ref = ref)

Calculate the allele sizing precision using the calculateStatistics function:

sureid_ladder_filter_stats <- calculateStatistics(data = sureid_ladder_filter, target = "Size", quant = 0.5,
                                 group = c("Marker","Allele"), count = NULL, decimals = 4)

Calculate and add the absolute difference:

# Calculate and add the absolute difference to the result.
sureid_ladder_filter_stats$Size.Diff <- sureid_ladder_filter_stats$Size.Max - sureid_ladder_filter_stats$Size.Min

View the result as interactive table (possible to sort and export) in RStudio Viewer tab:
```
# Load DT package (https://rstudio.github.io/DT/).
library(DT)

# Convert to a DT object.
DT.pr.stats <- datatable(sureid_ladder_filter_stats,
  rownames = FALSE, filter = "top", extensions = "Buttons",
  options = list(dom = "Bfrtip", buttons = c("copy", "csv", "excel", "pdf", "print"))) %>% formatRound(9, 2) # Round column 9 to 2 decimals.

# Show interactive table in RStudio Viewer tab.
DT.pr.stats
```
Alternatively, print manually sorted result in the RStudio Console tab:
```
# Load data.table package (https://rdatatable.gitlab.io/data.table/).
library(data.table)

# Convert to a data.table object.
DT.pr.stats <- data.table(sureid_ladder_filter_stats)

# Markers/alleles with the highest precision.
DT.pr.stats[order(Size.Diff, Marker, Allele)]
```
```
# Markers/alleles with the lowest precision.
DT.pr.stats[order(-Size.Diff, Marker, Allele)]
```
View the result as interactive plot in RStudio Viewer tab:
```
# Load the plotting package (https://plotly.com/r/getting-started/).
library(plotly)

# Convert to data.table
DT.pr <- data.table(sureid_ladder_filter)

# Calculate mean and deviation from mean, by marker and allele, and add the result to the data set.
DT.pr[, c("Mean", "Size.n") := list(mean(Size), .N), by =  .(Marker, Allele)]
DT.pr[, c("Deviation") := list(Mean - Size), by =  .(Marker, Allele)]

# Create and show interactive plot.
fig.pr <- plot_ly(data = DT.pr, x = ~Allele, y = ~Deviation, color = ~Marker, type = "scatter")
fig.pr
```

3 Estimation of Peak Balance

Lower template DNA may cause extreme heterozygote imbalance; as such, empirical heterozygote peak-height ratio data could be used to formulate mixture interpretation guidelines and determine the appropriate ratio by which two peaks are determined to be heterozygotes. SWGDAM Validation Guidelines for DNA Analysis Methods (Approved 12/05/2016).

The peak balance ratios of heterozygote alleles within a locus and of alleles between all loci should be >60% for good quality samples. ENFSI Recommended Minimum Criteria for the Validation of Various Aspects of the DNA Profiling Process (ISSUE DATE: November 2010).

Peak height imbalances may be seen in the typing results from, for example, a primer binding site variant that results in attenuated amplification of one allele of a heterozygous pair. Likewise, degraded, inhibited, and/or low level single-source DNA samples may exhibit poor peak height balance with heterozygous alleles. SWGDAM Interpretation Guidelines for Autosomal STR Typing by Forensic DNA Testing Laboratories (APPROVED 01/12/2017).

3.1 Experimental Design

The data set in this exercise comes from authentic FTA reference samples. 85 samples were amplified using the SureID 27 kit. A reference data set with known profiles is available.

The results from the capillary electrophoresis were analysed in GeneMapper using the previously determined analytical threshold (AT), without applied stutter filter and with no global cut-off.

3.2 Learning Outcome

This exercise is estimated to approximately 15 minutes and will teach:

Single file import.
Analysis of heterozygote balance (intra-locus balance) and profile balance (inter-locus balance) from a data set when there is a reference data set available.
Calculation of summary statistics.
Plotting of heterozygote balance and profile balance.
Saving plot as image file.

Additional Learning Resources

STR-validator video tutorial: Analysis of balance (2017).

Publications: Bright, Turkington, and Buckleton (2010), Bright et al. (2014), Hansson, Egeland, and Gill (2017)

3.3 Data Analysis

The data is available as an exported GeneMapper Genotypes Table file named sureid_data.txt

The reference data set is in a file named sureid_ref.txt

3.3.1 Analysis using STR-validator

Open STR-validator graphical user interface.
Import samples and references into STR-validator.
1. Select the Workspace tab and click the [Import] button.
2. Use the [Select file] button to locate the file in the input_files folder. Select the file sureid_data.txt and click [Open].
3. The exercise text file sureid_data.txt contain only the samples and is already in the slim STR-validator format so the Auto trim samples and Auto slim options can be unchecked. Normally, these options are checked to remove control samples and ladders and to convert the file from the semi-wide GeneMapper Table format to the slim STR-validator format.
4. The filename “sureid_data” is suggested as Name for dataset in the Save as group.
5. Click the [Import] button. Tip: the R terminal window show the path to the imported file, which is especially useful when importing multiple files.
6. Use the [Select file] button to locate the file in the input_files folder. Select the file sureid_ref.txt and click [Open].
7. The filename “sureid_ref” is suggested as Name for dataset in the Save as group.
8. Use the same settings as previously and click the [Import] button.
  
  The Import from files dialogue.
Calculate heterozygote peak balance (Hb).
1. Select the Balance tab. In the Heterozygote balance (intra-locus) function group, click the [Calculate] button.
2. Select the sureid_data data set.
3. Select the sureid_ref reference data set.
4. (Optional) Click the [Check subsetting] button to confirm that samples are matched with the correct reference. Tip: This is especially useful when one reference should match multiple samples, using sample name matching options.
5. Uncheck the pre-processing options Remove sex markers and Remove quality sensors. The SureID 27 kit does not contain Y-markers or quality sensors.
6. Select Smaller peak / larger peak in the drop-down menu Define Hb as.
7. Leave the Sample name matching options unchanged (the settings does not matter in this example).
8. (Optional) To speed up the analysis, the Post-processing option to Calculate average peak height can be unchecked if not needed. Tip: The calculated Proportion can be a useful quality control that all profiles are complete.
9. Accept the default name for the result and click the [Calculate] button.
  
  The Calculate heterozygote balance dialogue.
Plot the results.
1. Click the [Plot] button located in the Heterozygote balance (intra-locus) function group.
2. Select the sureid_data_hb data set.
3. Leave the Exclude sex markers option unchecked to keep Amelogenin, and the Log(balance) option unchecked to plot normal ratios.
4. Select the Do not facet or wrap option.
5. Click the [Hb vs. Marker] button. In RStudio the plot is shown in the Plots tab located in the lower right pane. Click the Zoom button to view the plot in a larger window. It is possible to export the plot from RStudio. Alternatively, the plot can be exported from STR-validator as described below. Tip: Inspect the plot to check that the result is reasonable and if there are unexpected outliers indicating errors or quality issues.
6. Click the [Save as image] button.
  1. Set the file extension to png image.
  2. Uncheck the Overwrite existing file and Load size from plot device options.
  3. Manually set the Width to “40” and Height to “20” cm. Leave the default “300” for Resolution and “1” for Scaling factor.
  4. Click [file] to locate a folder to save the image (e.g. output_files) and click [Select Folder]. Click the [Save] button to save the image.
  5. The settings for the Save as image dialogue will be remembered.
7. (Optional) In the Plot balance window, click the [Save as object] button to save the plot object in the workspace. The object can be viewed or manipulated at a later time.
8. Close the Plot balance window.
  
  The Plot balance dialogue.
  
  Example of a heterozygous balance plot.
Calculate summary statistics.
1. Click the [Statistics] Calculate summary statistics by marker button located in the Heterozygote balance (intra-locus) function group.
2. The options, including the sureid_data_hb_norm_dye data set (i.e. the most recent data set), should be pre-filled according to the screenshot.
3. Accept the default name for the result and click the [Calculate] button.
4. (Optional) Click the [View] button to view, copy, or export the result.
5. (Optional) Click the [Statistics] Calculate global summary statistics button.
  1. The sureid_data_hb data set must be manually selected (since it is no longer the most recent data set).
  2. Add “_global” to the suggested Name for result and click the [Calculate] button.
  3. Click the [View] button to view the minimum hb for the data set and overall percentile.
  The Calculate summary statistics dialogue.
Calculate the inter-locus (profile) balance.
1. Click the [Calculate] Calculate profile balance button located in the Profile balance (inter-locus) function group.
2. Select the sureid_data sample data set.
3. Select the sureid_ref reference data set. Tip: a reference data set is not needed, in which case the sum of peak heights, including artefacts, in each marker will be used to calculate the locus balance. This will speed up the analysis.
4. Check the pre-processing option Remove off-ladder alleles, uncheck the options Remove sex markers to keep Amelogenin, and uncheck Remove quality sensors.
5. Select to Calculate locus balance as Normalised and check the option to Calculate Lb by dye channel.
6. Leave the options for Reference sample name matching unchanged.
7. Check the post-processing option to Calculate average peak height (this option require a reference data set).
8. Add “_norm_dye” to the suggested Name for result. Tip: It is useful with a descriptive name in case you want to calculate multiple options.
9. Click the [Calculate] button.
  
  The Calculate locus balance dialogue.
Plot the results.
1. Click the [Plot] button located in the Profile balance (intra-locus) function group.
2. Select the sureid_data_lb data set.
3. Leave the Exclude sex markers option unchecked to keep Amelogenin, and the Log(balance) option unchecked to plot normal ratios.
4. Select the Do not facet or wrap option.
5. Click the [Lb vs. Marker] button. In RStudio the plot is shown in the Plots tab located in the lower right pane. Click the Zoom button to view the plot in a larger window. It is possible to export the plot from RStudio. Alternatively, the plot can be exported from STR-validator as described below. Tip: Inspect the plot to check that the result is reasonable and if there are unexpected outliers indicating errors or quality issues.
6. Click the [Save as image] button.
  1. Set the file extension to png image.
  2. Uncheck the Overwrite existing file and Load size from plot device options.
  3. Manually set the Width to “40” and Height to “20” cm. Leave the default “300” for Resolution and “1” for Scaling factor.
  4. Click [file] to locate a folder to save the image (e.g. output_files) and click [Select Folder]. Click the [Save] button to save the image.
  5. The settings for the Save as image dialogue will be remembered.
7. (Optional) In the Plot balance window, click the [Save as object] button to save the plot object in the workspace. The object can be viewed or manipulated at a later time.
8. Close the Plot balance window.
  
  The Plot balance dialogue.
  
  Example of a locus balance plot.

3.3.2 Analysis using the strvalidator package

Open RStudio and a new R Script document.
Load the strvalidator package:
```
library(strvalidator)
```

Import the data set and reference profiles to variables using the import function. The import.file parameter point to the file to be imported.

sureid_data <- import(import.file = params$path.to.data,
                  file.name = TRUE, time.stamp = TRUE,
                  auto.trim = FALSE, trim.samples = NULL, trim.invert = FALSE,
                  auto.slim = FALSE)

sureid_ref <- import(import.file = params$path.to.ref,
                  file.name = TRUE, time.stamp = TRUE,
                  auto.trim = FALSE, trim.samples = NULL, trim.invert = FALSE,
                  auto.slim = FALSE)

Calculate the heterozygote balance using the calculateHb function:

sureid_data_hb <- calculateHb(data = sureid_data, ref = sureid_ref, hb = 3,
                              kit = "SureID27", sex.rm = FALSE, qs.rm = FALSE,
                              ignore.case = TRUE, word = FALSE, exact = FALSE)

View the result as interactive plot in RStudio Viewer tab:
```
# Load the plotting package (https://plotly.com/r/getting-started/).
library(plotly)

# Sort by marker order defined in kit. Add factors according to kit information.
sureid_data_hb <- sortMarker(data = sureid_data_hb, kit = "SureID27", add.missing.levels = TRUE)

# Convert to data.table
DT.hb <- data.table::data.table(sureid_data_hb)

# Create interactive plot.
sureid_data_hb_plotly <- plotly::plot_ly(data = DT.hb, x = ~Marker, y = ~Hb, 
                                         color = ~Dye, type = "box", 
                                         colors = c("blue", "lightgreen", "goldenrod", "red"),
                                         text = ~paste("Sample: ", Sample.Name))

# Show interactive plot.
sureid_data_hb_plotly
```
Calculate summary statistics using the calculateStatistics function:
```
sureid_data_hb_stats <- calculateStatistics(data = sureid_data_hb, target = "Hb", quant = 0.05,
                                            group = c("Marker"), count = NULL, decimals = 4)

# Load DT package (https://rstudio.github.io/DT/).
library(DT)

# Convert to a datatable object.
DT.hb <- DT::datatable(sureid_data_hb_stats)

# Show interactive table.
DT.hb
```

Calculate the profile balance using the calculateLb function:

sureid_data_lb <- calculateLb(data = sureid_data, ref = sureid_ref, 
                              option = "norm", by.dye = TRUE, 
                              ol.rm = TRUE, sex.rm = FALSE, qs.rm = FALSE, na = NULL, 
                              kit = "SureID27", 
                              ignore.case = TRUE, word = FALSE, exact = FALSE)

View the result as interactive plot in RStudio Viewer tab:
```
# Sort by marker order defined in kit. Add factors according to kit information.
sureid_data_lb <- sortMarker(data = sureid_data_lb, kit = "SureID27", add.missing.levels = TRUE)

# Convert to data.table
dt.lb <- data.table::data.table(sureid_data_lb)

# Create interactive plot.
sureid_data_lb_plotly <- plotly::plot_ly(data = dt.lb, x = ~Marker, y = ~Lb, 
                                         color = ~Dye, type = "box", 
                                         colors = c("blue", "lightgreen", "goldenrod", "red"),
                                         text = ~paste("Sample: ", Sample.Name))

# Show interactive plot.
sureid_data_lb_plotly
```

4 Estimation of Stutter Ratios

The stutter ratio is the ratio of the stutter peak height compared to the corresponding allele peak height. In general, stutter peaks have to be lower than the % of the allele peak height indicated by the manufacturer of the kit to be ignored as a biological artefact of the sample. ENFSI Recommended Minimum Criteria for the Validation of Various Aspects of the DNA Profiling Process (ISSUE DATE: November 2010). Based on internal verification of the kit, it may be necessary to adjust stutter ratios provided by the manufacturer.

4.1 Experimental Design

The data set in this exercise comes from 85 authentic FTA reference samples amplified using the SureID 27 kit. A reference data set with known profiles is available.

The results from the capillary electrophoresis were analysed in GeneMapper using the previously determined analytical threshold, without applied stutter filter, and with no global cut-off.

4.2 Learning Outcome

This exercise is estimated to approximately 10 minutes and will teach:

Single file import.
Analysis of stutter ratios from a data set when there is a reference data set available.
Plotting of stutter data.
Saving plots as images.
Calculate summary statistics.

Additional Learning Resources

STR-validator video tutorial: Analysis of stutter ratios (2017).

Publications: Brookes et al. (2012), Gibb et al. (2009), Klintschar and Wiegand (2003)

4.3 Data Analysis

The result is available as an exported GeneMapper Genotypes Table named sureid_data.txt

The reference data set is in a file named sureid_ref.txt

4.3.1 Analysis using STR-validator

Open STR-validator graphical user interface.
Import samples and references into STR-validator.
1. Select the Workspace tab and click the [Import] button.
2. Use the [Select file] button to locate the file in the input_files folder. Select the file sureid_data.txt and click [Open].
3. The exercise text file sureid_data.txt contain only the samples and is already in the slim STR-validator format so the Auto trim samples and Auto slim options can be unchecked. Normally, these options are checked to remove control samples and ladders and to convert the file from the semi-wide GeneMapper Table format to the slim STR-validator format.
4. The filename “sureid_data” is suggested as Name for dataset in the Save as group.
5. Click the [Import] button. Tip: the R terminal window show the path to the imported file, which is especially useful when importing multiple files.
6. Use the [Select file] button to locate the file in the input_files folder. Select the file sureid_ref.txt and click [Open].
7. The filename “sureid_ref” is suggested as Name for dataset in the Save as group.
8. Use the same settings as previously and click the [Import] button.
  
  The Import from files dialogue.
Calculate stutter ratios.
1. Select the Stutter tab and click the [Calculate] button.
2. Select the sureid_data data set.
3. Select the sureid_ref reference data set.
4. (Optional) Click the [Check subsetting] button to confirm that samples are matched with the correct reference. Tip: This is especially useful when one reference should match multiple samples, using sample name matching options.
5. Set the option to calculate stutter ratios within the range “2” backward stutters to “1” forward stutter.
6. Select the radio button No overlap between stutter and alleles.
7. Optional: The table replace false stutters can be customized to fix artefacts from the numeric allele - stutter calculation, which does not take account for the number of base pairs in a repeat unit.
8. Accept the default name for the result and click the [Calculate] button.
  
  The Calculate stutter ratio dialogue.
Plot the results.
1. Click the [Plot] button.
2. Select the sureid_data_stutter data set.
3. Leave the Exclude sex markers option checked to exclude Amelogenin, and all other options unchecked.
4. Click the [Ratio vs. Allele] button. In RStudio the plot is shown in the Plots tab located in the lower right pane. Click the Zoom button to view the plot in a larger window. It is possible to export the plot from RStudio. Alternatively, the plot can be exported from STR-validator as described below. Tip: Inspect the plot to check that the result is reasonable and if there are unexpected outliers indicating errors or quality issues.
5. Click the [Save as image] button (the [Save as object] button is disabled due to a limitation in the plot function).
  1. Set the file extension to png image.
  2. Uncheck the Overwrite existing file and Load size from plot device options.
  3. Manually set the Width to “40” and Height to “20” cm. Leave the default “300” for Resolution and “1” for Scaling factor.
  4. Click [Open] to locate a folder to save the image (e.g. output_files) and click [Select Folder]. Click the [Save] button to save the image.
6. Close the Plot stutter ratios window.
  
  The Plot stutter ratios dialogue.
  
  Example of a stutter plot.
Calculate summary statistics.
1. Click the [Statistics] Calculate summary statistics by marker and stutter type button.
2. The options, including the sureid_data_stutter data set (i.e. the most recent data set), should be pre-filled according to the screenshot.
3. Click the [Calculate] button.
4. (Optional) Click the [View] button to view, copy, or export the result.
  
  The Calculate summary statistics dialogue.

4.3.2 Analysis using the strvalidator package

Open RStudio and a new R Script document.
Load the strvalidator package:
```
library(strvalidator)
```

Calculate stutter ratio using the calculateStutter function:

# Replace 'false' stutters.
val_replace <- c(-1.9, -1.8, -1.7, -0.9, -0.8, -0.7, 0.9, 0.8, 0.7)
val_by <- c(-1.3, -1.2, -1.1, -0.3, -0.2, -0.1, 0.3, 0.2, 0.1)

# Calculate stutters.
sureid_data_stutter <- calculateStutter(data = sureid_data, ref = sureid_ref,
                                        back = 2, forward = 1, interference = 0,
                                        replace.val = val_replace, by.val = val_by)

Calculate summary statistics using the calculateStatistics function:

# Calculate summary statistics.
sureid_data_stutter_stat <- calculateStatistics(data = sureid_data_stutter, target = c("Ratio"),
                                       group = c("Marker", "Type"), count = c("Allele"),
                                       quant = 0.95, decimals = 4)

View the result as interactive table (possible to sort and export) in RStudio Viewer tab:
```
# Sort by marker order defined in kit. Add factors according to kit information.
sureid_data_stutter_stat <- sortMarker(data = sureid_data_stutter_stat, kit = "SureID27", add.missing.levels = TRUE)

# Load the DT package.
library(DT)

# Convert to DT for interctive tables.
DT::datatable(sureid_data_stutter_stat, rownames = FALSE, filter = 'top', extensions = 'Buttons',
              options = list(dom = 'Bfrtip', buttons = c('copy', 'csv', 'excel', 'pdf', 'print')),
              caption = 'Table 1: This is a simple caption for the table.') %>%
  formatRound(5:9, 3) # Round columns 5-9 to 3 decimals.
```

5 Plot Kit Configuration and Marker Ranges

The kit configuration with marker ranges is an important features of a kit. It is easy to create illustrative figures including one or multiple kits.

5.1 Prerequisites

To enable plotting of kit configuration in STR-validator the kit of interest must be included in the kit configuration file. The file is installed with the strvalidator package, usually at C:\..\Dokument\R\win-library\4.1\strvalidator\extdata\kit.txt. It is possible to edit the file manually using a text editor or spreadsheet software. However, the easiest way of adding new kits is using the Manage kits function that is found under the DryLab tab by clicking the [Kits] button. The GeneMapper Bins and Panels file provided by the manufacturer of the kit is required. NB! If you have customized the kit configuration file, you should keep a backup since it is overwritten whenever a new versions of strvlidator is installed.

5.2 Learning Outcome

This exercise is estimated to approximately 5 minutes and will teach:

Creating a kit configuration plot.
Saving plots as images.

5.2.1 Analysis using STR-validator

Currently there is no strvalidator function to create kit configuration plots using the command line.

Open STR-validator graphical user interface.
Create kit conofiguration plot.
1. Select the DryLab tab and click the [Kits] button.
2. In the Select kits option group, check the box for “SureID27”.
3. Change the Kit name size and Marker name size to “6”, the Marker height to “0.4”, and the Marker transparency to “0.5”. Type “sureid_ggplot” in the Name for result field.
4. Click the [Plot] button. In RStudio the plot is shown in the Plots tab located in the lower right pane. Click the Zoom button to view the plot in a larger window. It is possible to export the plot from RStudio. Alternatively, the plot can be exported from STR-validator as described below.
5. Click the [Save as image] button.
  1. Set the file extension to png image.
  2. Uncheck the Overwrite existing file and Load size from plot device options.
  3. Manually set the Width to “40” and Height to “20” cm. Leave the default “300” for Resolution and “1” for Scaling factor.
  4. Click [Open] to locate a folder to save the image (e.g. output_files) and click [Select Folder]. Click the [Save] button to save the image.
6. Close the Plot kit window.
  
  The plot kit dialogue.
  
  Example of a kit plot.

References

Bregu, Joli, Danielle Conklin, Elisse Coronado, Margaret Terrill, Robin W. Cotton, and Catherine M. Grgicak. 2013. “Analytical Thresholds and Sensitivity, Establishing RFU Thresholds for Forensic DNA Analysis,” Journal of Forensic Sciences 58 (1): 120–29. https://doi.org/10.1111/1556-4029.12008.

Bright, Jo-Anne, Sharon Neville, James M. Curran, and John S. Buckleton. 2014. “Variability of Mixed DNA Profiles Separated on a 3130 and 3500 Capillary Electrophoresis Instrument.” Australian Journal of Forensic Sciences 46 (3): 304–12. https://doi.org/10.1080/00450618.2013.851279.

Bright, Jo-Anne, Jnana Turkington, and John Buckleton. 2010. “Examination of the Variability in Mixed DNA Profile Parameters for the Identifiler Multiplex.” Forensic Science International: Genetics 4 (2): 111–14. https://doi.org/10.1016/j.fsigen.2009.07.002.

Brookes, Clare, Jo-Anne Bright, SallyAnn Harbison, and John Buckleton. 2012. “Characterising Stutter in Forensic STR Multiplexes.” Forensic Science International: Genetics 6 (1): 58–63. https://doi.org/10.1016/j.fsigen.2011.02.001.

Ensenberger, Martin G., Kristy A. Lenz, Learden K. Matthies, Gregory M. Hadinoto, John E. Schienman, Angela J. Przech, Michael W. Morganti, et al. 2016. “Developmental validation of the PowerPlex® Fusion 6C System.” Forensic Science International: Genetics 21 (March): 134–44. https://doi.org/10.1016/j.fsigen.2015.12.011.

Gibb, Andrew J., Andrea-Louise Huell, Mark C. Simmons, and Rosalind M. Brown. 2009. “Characterisation of Forward Stutter in the AmpFlSTR® SGM Plus® PCR.” Science & Justice 49 (1): 24–31. https://doi.org/10.1016/j.scijus.2008.05.002.

Hansson, Oskar, Thore Egeland, and Peter Gill. 2017. “Characterization of Degradation and Heterozygote Balance by Simulation of the Forensic DNA Analysis Process.” International Journal of Legal Medicine 131 (2): 303–17. https://doi.org/10.1007/s00414-016-1453-x.

Klintschar, Michael, and Peter Wiegand. 2003. “Polymerase Slippage in Relation to the Uniformity of Tetrameric Repeat Stretches.” Forensic Science International 135 (2): 163–66. https://doi.org/10.1016/S0379-0738(03)00201-9.

Mönich, Ullrich J., Ken Duffy, Muriel Médard, Viveck Cadambe, Lauren E. Alfonse, and Catherine Grgicak. 2015. “Probabilistic Characterisation of Baseline Noise in STR Profiles.” Forensic Science International: Genetics 19 (November): 107–22. https://doi.org/10.1016/j.fsigen.2015.07.001.

Rakay, Christine A., Joli Bregu, and Catherine M. Grgicak. 2012. “Maximizing Allele Detection: Effects of Analytical Threshold and DNA Levels on Rates of Allele and Locus Drop-Out.” Forensic Science International: Genetics 6 (6): 723–28. https://doi.org/10.1016/j.fsigen.2012.06.012.

Virtual bins correspond to allele positions not physically present in the ladder. These have been verified by the manufacturer, or added by the laboratory↩︎

Validation – Experimental Design and Analysis Using STR-validator

Hands-on exercises using the graphical user interface and command line

Oskar Hansson

Updated: 2022-09-17

Introduction

1 Estimation of Analytical Thresholds

1.1 Experimental Design

1.2 Learning Outcome

1.3 Data analysis

1.3.1 Analysis using STR-validator

1.3.2 Analysis using the strvalidator package

2 Estimation of Allele Sizing Precision

2.1 Experimental Design

2.2 Learning Outcome

2.3 Data Analysis

2.3.1 Analysis using STR-validator

2.3.2 Analysis using the strvalidator package

3 Estimation of Peak Balance

3.1 Experimental Design

3.2 Learning Outcome

3.3 Data Analysis

3.3.1 Analysis using STR-validator

3.3.2 Analysis using the strvalidator package

4 Estimation of Stutter Ratios

4.1 Experimental Design

4.2 Learning Outcome

4.3 Data Analysis

4.3.1 Analysis using STR-validator

4.3.2 Analysis using the strvalidator package

5 Plot Kit Configuration and Marker Ranges

5.1 Prerequisites

5.2 Learning Outcome

5.2.1 Analysis using STR-validator

References