AIScreenR in herpes zoster HTA

Key summary

Background and Methods

This report evaluates the R-package AIScreenR for screening titles and abstracts, using an evidence synthesis on herpes zoster vaccination (HZV) as a benchmark. Developed by researchers at VIVE in Denmark, AIScreenR serves as a communication bridge between researchers and Large Language Models (LLMs), allowing the LLM to assess references based on specific inclusion and exclusion criteria.

The unit of analysis for this evaluation consisted of 11,445 references retrieved by the HZV team from the databases. We prepared the reference library, developed and tested a prompt designed to communicate the selection criteria to the LLM. To evaluate the AIScreenR’s performance, we developed custom code to visualize the results through a confusion matrix to report performance metrics, such as accuracy and workload savings.

Key findings

Workload reduction: AIScreenR reduced the screening workload by nearly 90%. It safely identified and removed 10,171 references (89% of the total) that had also been excluded by the HZV team.

Residual manual workload: The LLMs tend to be over-inclusive, allowing researchers to double-check the data. AIScreenR included 1,138 references that were excluded by the HZV team.

Recall and Accuracy: AIScreenR identified 61% of the references included by the HZV team but overlooked 48 relevant references. This recall should be higher (at least 75% before full implementation in evidence synthesis workflows).

Discussion

Recall Threshold: The current recall rate (61%) should reach at least 75% before full implementation in evidence synthesis workflows is recommended.

Error Analysis: A detailed analysis of the False Negatives (references wrongly excluded by the AI) is necessary to better understand and improve AIScreenR’s performance.

Current Utility: AIScreenR should currently be utilized to prioritize screening rather than to replace human review entirely.

Background

AIScreenR

AIScreenR (Vembye et al. 2025) is an R package designed to streamline the title and abstract screening phase of evidence synthesis by deploying Large Language Models (LLMs) through the OpenAI API. Its purpose is to act as an automated screener that evaluates the relevance of references against specific inclusion and exclusion criteria defined by researchers. AIScreenR allows researchers to instruct the AI via a prompt to assess the eligibility of a study. It is most commonly used to reduce the immense manual workload of evidence synthesis, particularly by serving as an efficient “second reviewer” that can accurately identify and remove irrelevant studies.

AIScreenR’s performance was validated in 2025, using as a benchmark the screening decisions made by the teams of 22 high-quality systematic reviews with different levels of complexity (Vembye et al. 2025). AIScreenR showed that the LLM performed on par with, and in some cases even better than, human reviewers in title and abstract screening (Vembye et al. 2025). The tool demonstrated high accuracy and reliability, especially when a “human-in-the-loop” strategy was employed. Through confusion matrices and error reports, researchers proved that AIScreenR can successfully filter clear inclusions and exclusions, making it a validated and robust addition to the modern evidence synthesis toolkit.

Objective

To evaluate the performance of AIScreenR in screening references at the title and abstract level, using a health technology assessment (HTA) on herpes zoster vaccination as the gold-standard comparative benchmark.

Methods

Insert information. See protocol.

Data requirements

For the AIScreenR screening functions to operate successfully, the dataset must contain at least three specific variables: a unique identifier for each record (e.g., studyid), the article title, and the abstract.

It is also highly recommended to include a human_code variable containing previous screening decisions (0 for excluded, 1 for included) in order to validate the AI’s performance.

Data quality

Before the screening phase, you must ensure data quality by filtering out records with missing abstracts and resolving any character encoding issues that could interfere with the API’s token processing in AIScreenR.

In many academic workflows, the search results will most likely be saved in an EndNote library before being brought into R. To facilitate the import, you should export your EndNote library as an XML or RIS file, as these formats are the most stable for preserving the relationship between titles and abstracts.

Once exported, the synthesisr::read_refs() function can parse these files directly, maintaining the metadata structure required for the screening process. This function belongs to the synthesisr package, which is a specialized tool for importing, exporting, and manipulating bibliographic data in R.

HZV use case

On 30 January 2026, the research librarian (Marita Heintz) provided the literature search results in a RIS file (.ris) format. The results of the title and abstract screening for the original Herpes Zoster vaccine HTA were not available. Instead, Eli Heeen and Lene Juvet provided Jose Meneses with the final list of included studies and those excluded at the full-text stage.

To reconstruct the dataset for validation, Jose Meneses manually coded these references in EndNote to establish the following classifications:

Title and Abstract Screening (TA)
- Included references (Included_TA)
- Excluded references (Excluded_TA)
Full-Text Screening (FT)
- Included references (Included_FT)
- Excluded references (Excluded_FT)

After manually coding the Included_TA references in EndNote, we reconstructed the Excluded_TA set by subtracting the Included_TA references from the complete library of references. To replicate this in EndNote, follow these steps:

Navigate to Groups > Create From Groups.
Name the new group (e.g., Excluded_TA).
Under “Include References in,” select the All References group.
Set the operator to NOT and select the Included_TA group in the dropdown.
Click Create.

Human_code variable

The human_code variable allows us to validate AIScreenR ’s performance, as it contains the screening decisions made by the HZV team (ie, the human reviewers).

We manually created a new field in EndNote to assign human assessments to each reference. This human_code variable must be 0 for Excluded_TA and 1 for Included_TA. To add this variable to each reference in our RIS file, we must first create the custom field in EndNote using the following steps:

Step 1: Create the Field in Your References

To make a tag “global,” you must first ensure the field exists within your database records.

Go to Edit > Preferences (Windows) > Select Reference Types from the menu > Modify Reference Types.
Find one of the empty columns (such as Custom 1) and type the name you want (e.g., human_code).
Click OK. All your references will now have this new field available.

Step 2: Bulk Fill the Field

If you want this field (tag) to appear across all existing references without typing them one by one:

Select all your references (Ctrl+A).
Go to the Library menu > Change/Move/Copy Fields.
In the Change Fields tab, select the field you just named (e.g., Custom 1).
Choose the option Replace whole field with: and type the text you want to appear (e.g, 1 for Included_TA).
Press OK.
Repeat this step for the Excluded_TA.

Step 3: Configure the RIS Export Format

EndNote does not export custom fields by default. You must instruct it to associate your new field with a RIS tag (such as U1).

Go to Edit > Output Styles > Edit “RefMan (RIS) Export”.
In the menu on the left, click on Templates.
Locate the reference type you use most (e.g., Journal Article).
You will see a list of tags like AU - Author. Go to the end of the list and manually add:
- Type U1 - and then click the Insert Field button to select your custom field (e.g., Custom 1).
Save your changes: Go to File > Save As and give it a name like Custom RIS Export.

Step 4: Exporting the edited RIS file

Ensure that when you clicked File > Export in EndNote, you actually select the new “Custom RIS Export” style, rather than the default “RefMan (RIS) Export, to ensure the U1 field is included in the final file.

Adjustment to references (optional)

We restricted the database to the variables required by AIScreenR: ID, title, abstract, and human_code. The human_code variable is essential only when human screening results are available, as it enables a direct comparison between AIScreenR’s performance and the decisions made by human reviewers.

Package and Data upload

It is practical to centralize all packages and functions used for the evaluation and reporting into a dedicated code chunk.

# Load packages and libraries
library(writexl)
library(dplyr)
library(synthesisr)
library(AIscreenR)
library(tibble)
library(revtools)
library(flextable)
library(tinytex)
library(tidyr)
library(ggplot2)
library(gt)
library(webshot2)
library(packrat)
library(rsconnect)

#Load data of the project
load(".RData")

Importing the dataset

To import references for AIScreenR, you typically utilize the synthesisr or revtools packages to convert standard bibliographic files—such as .ris, .bib, or .txt—into an R-compatible data frame. The process begins by reading the raw file into R, which generates a table where each row represents a unique study (Vembye et al. 2025).

We import the references that will be screened by AIScreenR.

HZV_references<-confusion_helvetesild3_values %>% 
  select(ID, human_code, title, abstract, studyid, promptid, prompt, n)
# These are the references with the human_code from EndNote but with the LLM judgements removed.

Missing Titles check

Missing titles: Use the following code to ensure no references are missing a title.

sum(is.na(HZV_references$title)) # check missing titles

[1] 0

view(HZV_references)

All the references have a title (No titles missing).

Missing Abstract check

The abstract is the primary source of information used by the AI to determine whether a study meets the inclusion criteria. Consequently, checking for missing abstracts is a critical step in ensuring the scientific integrity of the evidence synthesis.

AIScreenR, independent of the specific GPT model employed, tends to exclude references by default when information is insufficient. If a study without an abstract is actually relevant, it results in a False Negative, meaning key evidence is omitted from the synthesis.

Furthermore, the validation metrics for AIScreenR assume the AI has access to the same data as a human reviewer. Empty abstract fields artificially lower the performance scores, leading to biased conclusions regarding the tool’s validation for a specific project.

The abstracts in the database show “No information” instead of “NA”. Use this code to standardize the missing values to “NA”:

# Verify the "No information" items.
sum(HZV_references$abstract == "No information", na.rm = TRUE)

[1] 1616

# Convert "No information" to a real "NA".
HZV_references$abstract[HZV_references$abstract == "No information"] <- NA

We can check the number of missing abstracts now:

sum(is.na(HZV_references$abstract)) # check missing abstracts

[1] 1616

Abstracts are missing for 1,616 references, representing 14% of the 11,445 total entries in the EndNote library.

Set API Key

AIScreenR uses an API key to connect to a Large Language Model (LLM), in this case, the OpenAI API to connect to ChatGPT. After obtaining your API key (which is private and confidential), you can save it using the following code:

usethis::edit_r_environ() # Run this code to open your .Renviron file and  write CHATGPT_KEY=your_key. Thereafter, close and save the .Renviron file and restart RStudio (ctrl + shift + F10). From now on, the AIscreenR functions will automatically use get_api_key() to retrieve your API key from your R environment. 
 
get_api_key() # To check the API Key was saved successfully.

Prompt

The prompt is the set of instructions that the LLM follows to determine the relevance of each study. In other words, the prompt mirrors the selection criteria, defining the key inclusion and exclusion criteria as clearly as possible. To ensure high accuracy, the prompt should be detailed and thorough.

You can draft the prompt in either R or Word before integrating it into the pipeline.

HZV_prompt <- "Role: You are an expert systematic review screener specializing in vaccinology and epidemiology. 
Task: Evaluate whether a study meets the following specific inclusion criteria. 
Population: Adults aged fifty years or older. Include participants regardless of previous vaccination history, history of herpes zoster, or presence of comorbidities (e.g., cardiovascular disease, Alzheimer’s). 
Intervention: Recombinant Zoster Vaccine (RZV) (e.g., Shingrix, HZ/su) or Zoster Vaccine Live (ZVL) (e.g., Zostavax). 
Comparison/Control: Placebo, no vaccine, or an alternative vaccine (including head-to-head comparisons between RZV and ZVL). 
Study Design: Included: RCTs, non-randomized studies, observational studies (prospective/retrospective cohort, registry-based, quasi-experimental, case-control, nested case-control), and post-marketing safety evaluations. 
Secondary Research: Include systematic reviews, meta-analyses, and literature reviews. 
Excluded: Case studies and case series. 
Evaluation Rule: Does the study meet all the inclusion criteria? Only include the study if every criterion is satisfied."

Screening command

Results

Exploring the screening results

Print the results of the screening done by AIScreenR. Check the following variables:

decision_gpt: the raw gpt decision - either “1”, “0”, “1.1” for inclusion, exclusion, or uncertainty.
detailed_description: the detailed description of the given decision made by OpenAI’s GPT API models.
decision_binary: the binary gpt decision, that is 1 for inclusion and 0 for exclusion.

# Preview of the screening results data.
view(HZV_screening$answer_data) # inspect the data

# Distribution of LLM-generated screening decisions
HZV_screening$answer_data %>%  
  count(decision_binary)

#Saving the results file
saveRDS(HZV_screening, "HZV_screening.rds")

AIScreenR excluded 10,227 references and included 1,214. Additionally, four references were marked as ‘NA’ by the software. We checked these manually and reclassified them as excluded (0) as they are not deemed relevant to the study.

#Assign 0 to the references with NA (n=4)
HZV_screening$answer_data$decision_binary[is.na(HZV_screening$answer_data$decision_binary)]<-0

#Check
HZV_screening$answer_data %>%  
  count(decision_binary)

After cleaning the data, AIScreenR excluded 10,231 references and included 1,214 references Table 1.

# Data wrangling_row data (first table)
HZV_screening_rawtable <- HZV_screening$answer_data %>%
  count(decision_binary) %>%
  mutate(
    Decision = if_else(decision_binary == 1, "Included", "Excluded"),
    Percentage = n / sum(n)
  ) %>%
  select(Decision, n, Percentage)
view(HZV_screening_rawtable)

Table 1: AIScreenR screening results

Decision	Frequency (n)	Percentage (%)
AIScreenR screening results
Distribution of AIScreenR Screening Results
Excluded	10231	89.4%
Included	1214	10.6%
Note: Four references returned an error in AIScreenR; these were manually reviewed and categorized as excluded

Now that the data are clean with unique values of either included and excluded by AIScreenR, we can proceed to the analysis.

Analyzing the results

AIScreenR (and other AI tools) is seen as a diagnostic tool, detecting relevant references for the review question and excluding irrelevant ones. There are two primary entries: the binary inclusion/exclusion decisions made by the HZV team (human reviewers) and those made by AIScreenR. This comparison results in a 2x2 table, which we refer to as a ‘Confusion Matrix’ (Ghosh 2022).

Confusion matrix

A confusion matrix is a visual tool, a table, that allows us to evaluate the performance of an AI tool (such as AIScreenR) in selecting relevant studies (Ghosh 2022). In evidence synthesis, the confusion matrix help us in:

Error Analysis: Identifying misclassification patterns.
Bias Detection: Understanding the balance between False Negatives (FN) and False Positives (FP).
LLM Selection: Comparing different models based on their screening performance.
Fine-tuning: Evaluating how the LLMs respond to different prompting approaches.

The confusion matrix visualizes the following metrics:

True Positives (TP): Correctly included references.
True Negatives (TN): Correctly excluded references. This is vital for workload savings.
False Positives (FP): Incorrectly included references (Type I error); references the AI included but humans excluded.
False Negatives (FN): Incorrectly excluded references (Type II error/Miss Rate). This is the most critical metric, as it represents the relevant references the review team would have lost without human verification.

To prepare the key metrics for the confusion matrix, we retrieve the raw data (the distribution of decisions made by both AIScreenR and humans), organize these into a data frame, and then generate the confusion matrix table.

Retrieve and prepare the raw data.

Organize the data in a data frame.

Plotting the confusion matrix table.

The evaluation of the HZV screening process (n = 11,445 references from the searches) is shown in Table 2.

Table 2: Confusion matrix AIScreenR vs HZV team

Human reviewers	AIScreenR
Confusion matrix AIScreenR vs HZV team
Performance analysis in study selection
Human reviewers	Excluded	Included
Excluded	10171	1138
Included	48	75
Total	10,219	1,213

Interpreting the results:

True Negatives (TN)= 10,171 references (88.9%) were correctly excluded by both the HZV team and AIScreenR. In other words, AIScreenR safely removed 89% of the irrelevant records.
False Negatives (FN) = 48 references (0.4%) included by the HZV team but missed by AIScreenR.
True Positives (TP) = 75 references. These are the relevant studies correctly identified by both HZV team and AIScreenR.
False Positives (FP) = 1,138 references were included by AIScreenR but excluded by the HZV team. This represents the “residual manual workload” (i.e., the 1,138 papers the review team still needs to check manually to ensure they are truly irrelevant).

Key performance metrics

Based on the key performance metrics in the confusion matrix, we can now estimate the accuracy and the recall.

Recall (Sensitivity)

Recall represents the proportion of truly relevant references that were correctly identified and included by AIScreenR. This is critical metric as it measures the AIScreenR’s ability to find “all the needles in the haystack.” A high recall ensures that no relevant studies are missed during the screening process.

Accuracy

Accuracy is the proportion of all decisions made by AIScreenR (both inclusions and exclusions) that were correct. While informative, accuracy can sometimes be misleading in systematic reviews if the number of irrelevant papers vastly outweighs the relevant ones (class imbalance).

The screen_analyzer function in AIScreenR calculates both recall and accuracy Table 3.

Table 3: Performance metrics of AIScreenR results

gpt-4o-mini	Recall (Sensitivity)	Specificity
Performance metrics of AIScreenR results
Comparative performance of AIScreenR in screening references at the title and abstract level on herpes zoster vaccination as the gold-standard comparative benchmark
gpt-4o-mini	61.0%	89.9%

Specificity

Specificity measures the proportion of truly irrelevant references that were correctly excluded by AIScreenR. Unlike Recall that focuses on finding what is relevant, Specificity focuses on accurately “clearing out” the noise. High specificity reduces the number of irrelevant references that researchers must manually double-check.

Workload Savings

Workload savings quantifies the efficiency gain provided by AIScreenR, expressed as the percentage of the total references that the review team no longer needs to screen manually because AIScreenR correctly excluded. This metric directly reflects the time and resource reduction achieved by implementing AIScreenR.

Table 4 presents the key performance metrics of AIScreenR in screening the studies compared to the HZV team.

Table 4: Key performance metrics in screening of AIScreenR vs HZV team

Performance metric	Value	Interpretation
Recall (Sensitivity)	61.0%	The most vital metric for quality. It shows that `AIScreenR` captured 61% of the relevant references. The remaining 39% (the 48 False Negatives) were missed.
Accuracy	89.5%	The overall proportion of correct decisions (both inclusions and exclusions) made by `AIScreenR`. While high, accuracy can be misleading in imbalanced datasets.
Specificity	89.9%	The ability of `AIScreenR` to correctly identify and discard irrelevant studies. A 90% specificity is excellent for significantly narrowing down a large library.
Workload savings	89.4%	The “Efficiency Gain.” It measures the percentage of the total library that the HZV team was able to skip screening entirely thanks to `AIScreenR.`

Conclusions

AIScreenR should be used to prioritize screening rather than replacing human review entirely.
AIScreenR identified 61% of the references included by the HZV team, meaning it overlooked 48 relevant studies. This recall should be higher (at least 75% before full implementation in evidence synthesis workflows).
AIScreenR resulted in a workload reduction of nearly 90% during the tittle and abstract screening of the 11,445 references retrieved from the databases.
The HZV team could have safely excluded 90% of the total references by using AIScreenR.

References

Ghosh, Chandril. 2022. Data Analysis with Machine Learning for Psychologists. Springer International Publishing. https://doi.org/10.1007/978-3-031-14634-3.

Vembye, Mikkel Helding, Julian Christensen, Anja Bondebjerg Mølgaard, and Frederikke Lykke Witthöft Schytt. 2025. “Generative Pretrained Transformer Models Can Function as Highly Reliable Second Screeners of Titles and Abstracts in Systematic Reviews: A Proof of Concept and Common Guidelines.” Psychological Methods, July. https://doi.org/10.1037/met0000769.