Cleaning Data Obtained from a PDF File

Introduction

During my Master’s in Infectious Diseases and Tropical Medicine at the Federal University of Minas Gerais, I encountered a PDF document containing the QUALIS grades for numerous scientific publications. QUALIS is a classification system developed by CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior) to evaluate the quality of academic journals. To assess the QUALIS grades for over 30 journals where a professor had published in the last four years, I decided to extract the relevant data from the PDF and merge it with a .csv file obtained from PubMed. This report documents the steps taken to clean and organize the data from the PDF into a tidy format. The merging of this data with the data obtained from PUBMED in order to generate the final report will be detailed in a separate article.

The PDF from which this data has been extracted can be found here.

Step 1: Load Libraries

The first step is always to load the required libraries. These include pdftools, which allows us to work with PDF files, and tidyverse, which provides a suite of tools for data manipulation and visualization. These will be the backbone of our script.

# Load libraries
library(pdftools)
library(tidyverse)

Step 2: Read PDF and Initialize Data Structures

Next, we need to extract the text from the PDF file. We’ll use the pdf_text function from the pdftools package to convert the file into a character vector where each page of the PDF is represented as an element in the vector. After this, we prepare for data extraction by defining the QUALIS levels and initializing an empty tibble to hold the extracted data.

# Read the PDF text into a character vector
data_text <- pdf_text("QUALIS CAPES 2024.pdf")

# Define the levels for QUALIS in order
qualis_levels <- c("A1", "A2", "A3", "A4", 
                   "B1", "B2", "B3", "B4", 
                   "C", "NP")

# Initialize an empty tibble to store results
data_tibble <- tibble(
        ISSN = character(),
        Journal = character(),
        QUALIS = factor(levels = qualis_levels, ordered = TRUE)  # Ordered factor
)

# We can see how the data currently looks by looking at the first page
head(data_text, 1)

## [1] "                                    QUALIS CAPES\n\n  ISSN                                   TITULO                 ESTRATO\n\n0149‐1423   AAPG BULLETIN (PRINT)                                 A1\n1069‐6563   ACADEMIC EMERGENCY MEDICINE                           A1\n1040‐2446   ACADEMIC MEDICINE                                     A1\n0001‐4575   ACCIDENT ANALYSIS AND PREVENTION                      A1\n0951‐3574   ACCOUNTING, AUDITING & ACCOUNTABILITY JOURNAL         A1\n0001‐4842   ACCOUNTS OF CHEMICAL RESEARCH                         A1\n0360‐0300   ACM COMPUTING SURVEYS                                 A1\n0734‐2071   ACM TRANSACTIONS ON COMPUTER SYSTEMS                  A1\n1946‐6226   ACM TRANSACTIONS ON COMPUTING EDUCATION               A1\n0730‐0301   ACM TRANSACTIONS ON GRAPHICS                          A1\n1046‐8188   ACM TRANSACTIONS ON INFORMATION SYSTEMS               A1\n1556‐4681   ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA     A1\n1944‐8252   ACS APPLIED MATERIALS & INTERFACES (ONLINE)           A1\n2155‐5435   ACS CATALYSIS                                         A1\n2374‐7951   ACS CENTRAL SCIENCE (ONLINE)                          A1\n1554‐8929   ACS CHEMICAL BIOLOGY                                  A1\n1554‐8937   ACS CHEMICAL BIOLOGY                                  A1\n1948‐7193   ACS CHEMICAL NEUROSCIENCE                             A1\n2373‐8227   ACS INFECTIOUS DISEASES                               A1\n2161‐1653   ACS MACRO LETTERS                                     A1\n1936‐0851   ACS NANO                                              A1\n2379‐3694   ACS SENSORS                                           A1\n2168‐0485   ACS SUSTAINABLE CHEMISTRY & ENGINEERING               A1\n2161‐5063   ACS SYNTHETIC BIOLOGY                                 A1\n0094‐5765   ACTA ASTRONAUTICA                                     A1\n0001‐5237   ACTA ASTRONOMICA                                      A1\n1742‐7061   ACTA BIOMATERIALIA                                    A1\n2052‐5206   ACTA CRYSTALLOGRAPHICA SECTION B                      A1\n1359‐6454   ACTA MATERIALIA (OXFORD)                              A1\n0001‐5962   ACTA MATHEMATICA                                      A1\n0001‐6322   ACTA NEUROPATHOLOGICA                                 A1\n2051‐5960   ACTA NEUROPATHOLOGICA COMMUNICATIONS                  A1\n1745‐3674   ACTA ORTHOPAEDICA (PRINT)                             A1\n2211‐3835   ACTA PHARMACEUTICA SINICA B                           A1\n1748‐1716   ACTA PHYSIOLOGICA (ONLINE)                            A1\n0186‐6028   ACTA SOCIOLOGICA                                      A1\n0001‐706X   ACTA TROPICA                                          A1\n0335‐5322   ACTES DE LA RECHERCHE EN SCIENCES SOCIALES            A1\n2270‐4957   ACTES SÉMIOTIQUES (EN LIGNE)                          A1\n0965‐2140   ADDICTION (ABINGDON. PRINT)                           A1\n"

Step 3: Define Regular Expressions

With the PDF text loaded and data structures ready, it’s time to define the rules for extracting the data we need. We’ll use regular expressions to identify patterns for ISSNs and QUALIS grades in the text.

# Define ISSN regex
# A sequence of 4 digits, followed by a special hyphen, followed by 3 digits,
# followed by a letter or a digit
issn_pattern <- "\\d{4}‐\\d{3}[a-zA-Z0-9]"

# Define QUALIS regex
# Either A or B followed by one number between 1 and 4 OR
# C OR
# NP
# all of those being words themselves, not found in the middle of other words
qualis_pattern <- "\\b(?:[AB][1-4]|C|NP)\\b"

Step 4: Clean and Prepare Text

Before extracting the data, we need to clean up the text. Specifically, we’ll locate the first occurrence of an ISSN and remove extraneous text before it.

# Find the first occurrence of the ISSN pattern across the first page
first_issn_index <- str_locate(data_text[1], issn_pattern)[1]

# Remove all content before the first ISSN
data_text[1] <- str_sub(data_text[1], start = first_issn_index)

# We can check how the first page looks now
head(data_text,1)

## [1] "0149‐1423   AAPG BULLETIN (PRINT)                                 A1\n1069‐6563   ACADEMIC EMERGENCY MEDICINE                           A1\n1040‐2446   ACADEMIC MEDICINE                                     A1\n0001‐4575   ACCIDENT ANALYSIS AND PREVENTION                      A1\n0951‐3574   ACCOUNTING, AUDITING & ACCOUNTABILITY JOURNAL         A1\n0001‐4842   ACCOUNTS OF CHEMICAL RESEARCH                         A1\n0360‐0300   ACM COMPUTING SURVEYS                                 A1\n0734‐2071   ACM TRANSACTIONS ON COMPUTER SYSTEMS                  A1\n1946‐6226   ACM TRANSACTIONS ON COMPUTING EDUCATION               A1\n0730‐0301   ACM TRANSACTIONS ON GRAPHICS                          A1\n1046‐8188   ACM TRANSACTIONS ON INFORMATION SYSTEMS               A1\n1556‐4681   ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA     A1\n1944‐8252   ACS APPLIED MATERIALS & INTERFACES (ONLINE)           A1\n2155‐5435   ACS CATALYSIS                                         A1\n2374‐7951   ACS CENTRAL SCIENCE (ONLINE)                          A1\n1554‐8929   ACS CHEMICAL BIOLOGY                                  A1\n1554‐8937   ACS CHEMICAL BIOLOGY                                  A1\n1948‐7193   ACS CHEMICAL NEUROSCIENCE                             A1\n2373‐8227   ACS INFECTIOUS DISEASES                               A1\n2161‐1653   ACS MACRO LETTERS                                     A1\n1936‐0851   ACS NANO                                              A1\n2379‐3694   ACS SENSORS                                           A1\n2168‐0485   ACS SUSTAINABLE CHEMISTRY & ENGINEERING               A1\n2161‐5063   ACS SYNTHETIC BIOLOGY                                 A1\n0094‐5765   ACTA ASTRONAUTICA                                     A1\n0001‐5237   ACTA ASTRONOMICA                                      A1\n1742‐7061   ACTA BIOMATERIALIA                                    A1\n2052‐5206   ACTA CRYSTALLOGRAPHICA SECTION B                      A1\n1359‐6454   ACTA MATERIALIA (OXFORD)                              A1\n0001‐5962   ACTA MATHEMATICA                                      A1\n0001‐6322   ACTA NEUROPATHOLOGICA                                 A1\n2051‐5960   ACTA NEUROPATHOLOGICA COMMUNICATIONS                  A1\n1745‐3674   ACTA ORTHOPAEDICA (PRINT)                             A1\n2211‐3835   ACTA PHARMACEUTICA SINICA B                           A1\n1748‐1716   ACTA PHYSIOLOGICA (ONLINE)                            A1\n0186‐6028   ACTA SOCIOLOGICA                                      A1\n0001‐706X   ACTA TROPICA                                          A1\n0335‐5322   ACTES DE LA RECHERCHE EN SCIENCES SOCIALES            A1\n2270‐4957   ACTES SÉMIOTIQUES (EN LIGNE)                          A1\n0965‐2140   ADDICTION (ABINGDON. PRINT)                           A1\n"

Step 5: Extract Data from Each Page

This step involves extracting the ISSN, journal names, and QUALIS grades from each page of the PDF. For this, we’ll iterate through the pages, locate relevant patterns, and use string manipulation to extract and store the data.

# Iterate across all pages
for(i in 1:length(data_text)){
        # remove unnecessary whitespace
        data_text[i] <- str_squish(data_text[i])
        
        # getting the location of all the issn's in the page
        issnPage <- as_tibble(str_locate_all(data_text[i],issn_pattern)[[1]])
        
        # getting the location of all the qualis's in the page
        qualisPage <- as_tibble(str_locate_all(data_text[i], qualis_pattern)[[1]])
        
        # creating an index to determine at what location we're in the page
        # setting its starting value at the beginning of the page
        indexPage <- 1
        
        # while we don't reach the end of the page
        while(indexPage < qualisPage[nrow(qualisPage),2]){
                # determine the index of the next issn
                issnPage %>% filter(start > indexPage) %>% 
                        slice_min(start, n=1) %>% pull(start) -> startNextIssn
                
                # determine the index for the end of the next QUALIS
                qualisPage %>% filter (start > startNextIssn) %>%
                        slice_min(start, n=1) %>% pull(end) -> endNextQualis
                
                # set the index to be the same as the endNextQualis, as this
                # is where we are in the page
                
                indexPage <- endNextQualis
                
                # extract the information between these two indexes
                str_sub(data_text[i], start = startNextIssn,
                        end = endNextQualis) -> currentLine
                
                # extract the issn from the current line
                str_sub(currentLine, end = 9) -> currentIssn
                
                # extract the QUALIS from the current line using regex
                currentQUALIS <- factor(str_extract(currentLine, qualis_pattern),
                                        levels = qualis_levels,
                                        ordered = TRUE)
                
                # dynamically calculate the end position for the journal name
                qualis_length <- str_length(currentQUALIS)
                
                # extract the journal name
                currentJournal <- str_sub(currentLine,
                                          start = 11, # Journal name always starts here
                                          end = str_length(currentLine) - qualis_length - 1)
                
                # Clean the journal name to remove "(ONLINE)" and "(PRINT)", case-insensitive
                currentJournal <- str_remove_all(currentJournal,
                                                 "(?i)\\s*\\((ONLINE|PRINT)\\)")
                
                # add this data to the tibble
                data_tibble %>% add_row(ISSN = currentIssn,
                                        Journal = currentJournal,
                                        QUALIS = currentQUALIS) -> data_tibble
        }
}

data_tibble <- data_tibble %>%
  mutate(
    Journal = str_remove_all(Journal, "(?i)\\s*\\((ONLINE|PRINT)\\)"),  # Remove specific tags
    Journal = case_when(
      Journal == "INTERN EMERG MED" ~ "Internal and Emergency Medicine",  # Manual fix
      TRUE ~ Journal  # Keep all other journal names as they are
    )
  )


# We can look at the final result by printing the tibble

data_tibble

## # A tibble: 21,564 × 3
##    ISSN      Journal                                       QUALIS
##    <chr>     <chr>                                         <ord> 
##  1 1069‐6563 ACADEMIC EMERGENCY MEDICINE                   A1    
##  2 1040‐2446 ACADEMIC MEDICINE                             A1    
##  3 0001‐4575 ACCIDENT ANALYSIS AND PREVENTION              A1    
##  4 0951‐3574 ACCOUNTING, AUDITING & ACCOUNTABILITY JOURNAL A1    
##  5 0001‐4842 ACCOUNTS OF CHEMICAL RESEARCH                 A1    
##  6 0360‐0300 ACM COMPUTING SURVEYS                         A1    
##  7 0734‐2071 ACM TRANSACTIONS ON COMPUTER SYSTEMS          A1    
##  8 1946‐6226 ACM TRANSACTIONS ON COMPUTING EDUCATION       A1    
##  9 0730‐0301 ACM TRANSACTIONS ON GRAPHICS                  A1    
## 10 1046‐8188 ACM TRANSACTIONS ON INFORMATION SYSTEMS       A1    
## # ℹ 21,554 more rows

Step 6: Normalize ISSN Hyphens

Finally, to ensure data consistency, we replace the non-standard hyphens in the ISSNs with standard ASCII hyphens.

# I noticed that the hyphen ("‐") that separates the digits in the ISSN is not
# the default ASCII hyphen, which leads to issues when merging the tibble

# correcting the hyphen to a default ASCII hyphen
data_tibble %>% mutate(ISSN = str_replace_all(ISSN, "‐","-")) -> qualis_data

# Removing all variables that don't have any purpose now
rm(list=setdiff(ls(),"qualis_data"))

# saving the data
saveRDS(qualis_data, "qualis_data.rds")

# printing the final data
qualis_data

## # A tibble: 21,564 × 3
##    ISSN      Journal                                       QUALIS
##    <chr>     <chr>                                         <ord> 
##  1 1069-6563 ACADEMIC EMERGENCY MEDICINE                   A1    
##  2 1040-2446 ACADEMIC MEDICINE                             A1    
##  3 0001-4575 ACCIDENT ANALYSIS AND PREVENTION              A1    
##  4 0951-3574 ACCOUNTING, AUDITING & ACCOUNTABILITY JOURNAL A1    
##  5 0001-4842 ACCOUNTS OF CHEMICAL RESEARCH                 A1    
##  6 0360-0300 ACM COMPUTING SURVEYS                         A1    
##  7 0734-2071 ACM TRANSACTIONS ON COMPUTER SYSTEMS          A1    
##  8 1946-6226 ACM TRANSACTIONS ON COMPUTING EDUCATION       A1    
##  9 0730-0301 ACM TRANSACTIONS ON GRAPHICS                  A1    
## 10 1046-8188 ACM TRANSACTIONS ON INFORMATION SYSTEMS       A1    
## # ℹ 21,554 more rows

The data is now tidy and ready to be used in further applications!