During my Master’s in Infectious Diseases and Tropical Medicine at the Federal University of Minas Gerais, I encountered a PDF document containing the QUALIS grades for numerous scientific publications. QUALIS is a classification system developed by CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior) to evaluate the quality of academic journals. To assess the QUALIS grades for over 30 journals where a professor had published in the last four years, I decided to extract the relevant data from the PDF and merge it with a .csv file obtained from PubMed. This report documents the steps taken to clean and organize the data from the PDF into a tidy format. The merging of this data with the data obtained from PUBMED in order to generate the final report will be detailed in a separate article.
The PDF from which this data has been extracted can be found here.
The first step is always to load the required libraries. These
include pdftools, which allows us to work with PDF files,
and tidyverse, which provides a suite of tools for data
manipulation and visualization. These will be the backbone of our
script.
# Load libraries
library(pdftools)
library(tidyverse)
Next, we need to extract the text from the PDF file. We’ll use the
pdf_text function from the pdftools package to
convert the file into a character vector where each page of the PDF is
represented as an element in the vector. After this, we prepare for data
extraction by defining the QUALIS levels and initializing an empty
tibble to hold the extracted data.
# Read the PDF text into a character vector
data_text <- pdf_text("QUALIS CAPES 2024.pdf")
# Define the levels for QUALIS in order
qualis_levels <- c("A1", "A2", "A3", "A4",
"B1", "B2", "B3", "B4",
"C", "NP")
# Initialize an empty tibble to store results
data_tibble <- tibble(
ISSN = character(),
Journal = character(),
QUALIS = factor(levels = qualis_levels, ordered = TRUE) # Ordered factor
)
# We can see how the data currently looks by looking at the first page
head(data_text, 1)
## [1] " QUALIS CAPES\n\n ISSN TITULO ESTRATO\n\n0149‐1423 AAPG BULLETIN (PRINT) A1\n1069‐6563 ACADEMIC EMERGENCY MEDICINE A1\n1040‐2446 ACADEMIC MEDICINE A1\n0001‐4575 ACCIDENT ANALYSIS AND PREVENTION A1\n0951‐3574 ACCOUNTING, AUDITING & ACCOUNTABILITY JOURNAL A1\n0001‐4842 ACCOUNTS OF CHEMICAL RESEARCH A1\n0360‐0300 ACM COMPUTING SURVEYS A1\n0734‐2071 ACM TRANSACTIONS ON COMPUTER SYSTEMS A1\n1946‐6226 ACM TRANSACTIONS ON COMPUTING EDUCATION A1\n0730‐0301 ACM TRANSACTIONS ON GRAPHICS A1\n1046‐8188 ACM TRANSACTIONS ON INFORMATION SYSTEMS A1\n1556‐4681 ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA A1\n1944‐8252 ACS APPLIED MATERIALS & INTERFACES (ONLINE) A1\n2155‐5435 ACS CATALYSIS A1\n2374‐7951 ACS CENTRAL SCIENCE (ONLINE) A1\n1554‐8929 ACS CHEMICAL BIOLOGY A1\n1554‐8937 ACS CHEMICAL BIOLOGY A1\n1948‐7193 ACS CHEMICAL NEUROSCIENCE A1\n2373‐8227 ACS INFECTIOUS DISEASES A1\n2161‐1653 ACS MACRO LETTERS A1\n1936‐0851 ACS NANO A1\n2379‐3694 ACS SENSORS A1\n2168‐0485 ACS SUSTAINABLE CHEMISTRY & ENGINEERING A1\n2161‐5063 ACS SYNTHETIC BIOLOGY A1\n0094‐5765 ACTA ASTRONAUTICA A1\n0001‐5237 ACTA ASTRONOMICA A1\n1742‐7061 ACTA BIOMATERIALIA A1\n2052‐5206 ACTA CRYSTALLOGRAPHICA SECTION B A1\n1359‐6454 ACTA MATERIALIA (OXFORD) A1\n0001‐5962 ACTA MATHEMATICA A1\n0001‐6322 ACTA NEUROPATHOLOGICA A1\n2051‐5960 ACTA NEUROPATHOLOGICA COMMUNICATIONS A1\n1745‐3674 ACTA ORTHOPAEDICA (PRINT) A1\n2211‐3835 ACTA PHARMACEUTICA SINICA B A1\n1748‐1716 ACTA PHYSIOLOGICA (ONLINE) A1\n0186‐6028 ACTA SOCIOLOGICA A1\n0001‐706X ACTA TROPICA A1\n0335‐5322 ACTES DE LA RECHERCHE EN SCIENCES SOCIALES A1\n2270‐4957 ACTES SÉMIOTIQUES (EN LIGNE) A1\n0965‐2140 ADDICTION (ABINGDON. PRINT) A1\n"
With the PDF text loaded and data structures ready, it’s time to define the rules for extracting the data we need. We’ll use regular expressions to identify patterns for ISSNs and QUALIS grades in the text.
# Define ISSN regex
# A sequence of 4 digits, followed by a special hyphen, followed by 3 digits,
# followed by a letter or a digit
issn_pattern <- "\\d{4}‐\\d{3}[a-zA-Z0-9]"
# Define QUALIS regex
# Either A or B followed by one number between 1 and 4 OR
# C OR
# NP
# all of those being words themselves, not found in the middle of other words
qualis_pattern <- "\\b(?:[AB][1-4]|C|NP)\\b"
Before extracting the data, we need to clean up the text. Specifically, we’ll locate the first occurrence of an ISSN and remove extraneous text before it.
# Find the first occurrence of the ISSN pattern across the first page
first_issn_index <- str_locate(data_text[1], issn_pattern)[1]
# Remove all content before the first ISSN
data_text[1] <- str_sub(data_text[1], start = first_issn_index)
# We can check how the first page looks now
head(data_text,1)
## [1] "0149‐1423 AAPG BULLETIN (PRINT) A1\n1069‐6563 ACADEMIC EMERGENCY MEDICINE A1\n1040‐2446 ACADEMIC MEDICINE A1\n0001‐4575 ACCIDENT ANALYSIS AND PREVENTION A1\n0951‐3574 ACCOUNTING, AUDITING & ACCOUNTABILITY JOURNAL A1\n0001‐4842 ACCOUNTS OF CHEMICAL RESEARCH A1\n0360‐0300 ACM COMPUTING SURVEYS A1\n0734‐2071 ACM TRANSACTIONS ON COMPUTER SYSTEMS A1\n1946‐6226 ACM TRANSACTIONS ON COMPUTING EDUCATION A1\n0730‐0301 ACM TRANSACTIONS ON GRAPHICS A1\n1046‐8188 ACM TRANSACTIONS ON INFORMATION SYSTEMS A1\n1556‐4681 ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA A1\n1944‐8252 ACS APPLIED MATERIALS & INTERFACES (ONLINE) A1\n2155‐5435 ACS CATALYSIS A1\n2374‐7951 ACS CENTRAL SCIENCE (ONLINE) A1\n1554‐8929 ACS CHEMICAL BIOLOGY A1\n1554‐8937 ACS CHEMICAL BIOLOGY A1\n1948‐7193 ACS CHEMICAL NEUROSCIENCE A1\n2373‐8227 ACS INFECTIOUS DISEASES A1\n2161‐1653 ACS MACRO LETTERS A1\n1936‐0851 ACS NANO A1\n2379‐3694 ACS SENSORS A1\n2168‐0485 ACS SUSTAINABLE CHEMISTRY & ENGINEERING A1\n2161‐5063 ACS SYNTHETIC BIOLOGY A1\n0094‐5765 ACTA ASTRONAUTICA A1\n0001‐5237 ACTA ASTRONOMICA A1\n1742‐7061 ACTA BIOMATERIALIA A1\n2052‐5206 ACTA CRYSTALLOGRAPHICA SECTION B A1\n1359‐6454 ACTA MATERIALIA (OXFORD) A1\n0001‐5962 ACTA MATHEMATICA A1\n0001‐6322 ACTA NEUROPATHOLOGICA A1\n2051‐5960 ACTA NEUROPATHOLOGICA COMMUNICATIONS A1\n1745‐3674 ACTA ORTHOPAEDICA (PRINT) A1\n2211‐3835 ACTA PHARMACEUTICA SINICA B A1\n1748‐1716 ACTA PHYSIOLOGICA (ONLINE) A1\n0186‐6028 ACTA SOCIOLOGICA A1\n0001‐706X ACTA TROPICA A1\n0335‐5322 ACTES DE LA RECHERCHE EN SCIENCES SOCIALES A1\n2270‐4957 ACTES SÉMIOTIQUES (EN LIGNE) A1\n0965‐2140 ADDICTION (ABINGDON. PRINT) A1\n"
This step involves extracting the ISSN, journal names, and QUALIS grades from each page of the PDF. For this, we’ll iterate through the pages, locate relevant patterns, and use string manipulation to extract and store the data.
# Iterate across all pages
for(i in 1:length(data_text)){
# remove unnecessary whitespace
data_text[i] <- str_squish(data_text[i])
# getting the location of all the issn's in the page
issnPage <- as_tibble(str_locate_all(data_text[i],issn_pattern)[[1]])
# getting the location of all the qualis's in the page
qualisPage <- as_tibble(str_locate_all(data_text[i], qualis_pattern)[[1]])
# creating an index to determine at what location we're in the page
# setting its starting value at the beginning of the page
indexPage <- 1
# while we don't reach the end of the page
while(indexPage < qualisPage[nrow(qualisPage),2]){
# determine the index of the next issn
issnPage %>% filter(start > indexPage) %>%
slice_min(start, n=1) %>% pull(start) -> startNextIssn
# determine the index for the end of the next QUALIS
qualisPage %>% filter (start > startNextIssn) %>%
slice_min(start, n=1) %>% pull(end) -> endNextQualis
# set the index to be the same as the endNextQualis, as this
# is where we are in the page
indexPage <- endNextQualis
# extract the information between these two indexes
str_sub(data_text[i], start = startNextIssn,
end = endNextQualis) -> currentLine
# extract the issn from the current line
str_sub(currentLine, end = 9) -> currentIssn
# extract the QUALIS from the current line using regex
currentQUALIS <- factor(str_extract(currentLine, qualis_pattern),
levels = qualis_levels,
ordered = TRUE)
# dynamically calculate the end position for the journal name
qualis_length <- str_length(currentQUALIS)
# extract the journal name
currentJournal <- str_sub(currentLine,
start = 11, # Journal name always starts here
end = str_length(currentLine) - qualis_length - 1)
# Clean the journal name to remove "(ONLINE)" and "(PRINT)", case-insensitive
currentJournal <- str_remove_all(currentJournal,
"(?i)\\s*\\((ONLINE|PRINT)\\)")
# add this data to the tibble
data_tibble %>% add_row(ISSN = currentIssn,
Journal = currentJournal,
QUALIS = currentQUALIS) -> data_tibble
}
}
data_tibble <- data_tibble %>%
mutate(
Journal = str_remove_all(Journal, "(?i)\\s*\\((ONLINE|PRINT)\\)"), # Remove specific tags
Journal = case_when(
Journal == "INTERN EMERG MED" ~ "Internal and Emergency Medicine", # Manual fix
TRUE ~ Journal # Keep all other journal names as they are
)
)
# We can look at the final result by printing the tibble
data_tibble
## # A tibble: 21,564 × 3
## ISSN Journal QUALIS
## <chr> <chr> <ord>
## 1 1069‐6563 ACADEMIC EMERGENCY MEDICINE A1
## 2 1040‐2446 ACADEMIC MEDICINE A1
## 3 0001‐4575 ACCIDENT ANALYSIS AND PREVENTION A1
## 4 0951‐3574 ACCOUNTING, AUDITING & ACCOUNTABILITY JOURNAL A1
## 5 0001‐4842 ACCOUNTS OF CHEMICAL RESEARCH A1
## 6 0360‐0300 ACM COMPUTING SURVEYS A1
## 7 0734‐2071 ACM TRANSACTIONS ON COMPUTER SYSTEMS A1
## 8 1946‐6226 ACM TRANSACTIONS ON COMPUTING EDUCATION A1
## 9 0730‐0301 ACM TRANSACTIONS ON GRAPHICS A1
## 10 1046‐8188 ACM TRANSACTIONS ON INFORMATION SYSTEMS A1
## # ℹ 21,554 more rows
Finally, to ensure data consistency, we replace the non-standard hyphens in the ISSNs with standard ASCII hyphens.
# I noticed that the hyphen ("‐") that separates the digits in the ISSN is not
# the default ASCII hyphen, which leads to issues when merging the tibble
# correcting the hyphen to a default ASCII hyphen
data_tibble %>% mutate(ISSN = str_replace_all(ISSN, "‐","-")) -> qualis_data
# Removing all variables that don't have any purpose now
rm(list=setdiff(ls(),"qualis_data"))
# saving the data
saveRDS(qualis_data, "qualis_data.rds")
# printing the final data
qualis_data
## # A tibble: 21,564 × 3
## ISSN Journal QUALIS
## <chr> <chr> <ord>
## 1 1069-6563 ACADEMIC EMERGENCY MEDICINE A1
## 2 1040-2446 ACADEMIC MEDICINE A1
## 3 0001-4575 ACCIDENT ANALYSIS AND PREVENTION A1
## 4 0951-3574 ACCOUNTING, AUDITING & ACCOUNTABILITY JOURNAL A1
## 5 0001-4842 ACCOUNTS OF CHEMICAL RESEARCH A1
## 6 0360-0300 ACM COMPUTING SURVEYS A1
## 7 0734-2071 ACM TRANSACTIONS ON COMPUTER SYSTEMS A1
## 8 1946-6226 ACM TRANSACTIONS ON COMPUTING EDUCATION A1
## 9 0730-0301 ACM TRANSACTIONS ON GRAPHICS A1
## 10 1046-8188 ACM TRANSACTIONS ON INFORMATION SYSTEMS A1
## # ℹ 21,554 more rows
The data is now tidy and ready to be used in further applications!