Introduction

This document details the process of merging and enriching data extracted from PubMed with QUALIS classification data, obtained from a PDF. The main goal is to assign QUALIS grades to journals listed in the PubMed dataset based on their ISSNs or, when no direct match is found, using fuzzy matching on journal titles. This report will guide you through each step of the process.

Step 1: Load Required Libraries

The first step is to load the necessary libraries. The tidyverse package is used for data manipulation and cleaning, while the stringdist package is employed for fuzzy string matching. We also load the pre-saved qualis_data object from an RDS file, which contains the QUALIS grades extracted from the PDF.

The process of generating qualis_data was detailed here.

# Loading required libraries
library(tidyverse)
library(stringdist)

# Load pre-saved qualis_data object
qualis_data <- readRDS("qualis_data.rds")

Step 2: Load the PubMed Data

Here, we load the dataset containing information about the researcher’s publications from PubMed. This file includes variables such as ISSN, journal title, and other metadata about the publications.

# Reading in data extracted from PUBMED
pubmed_data <- read_csv("Maria Auxiliadora Parreiras Martins.csv",
                        show_col_types = FALSE)

Step 3: Merge QUALIS Data Using ISSN

In this step, we attempt to directly match and merge the QUALIS grades from the qualis_data dataset into the pubmed_data dataset using the ISSN column as a key. This straightforward join should capture all journals with directly matching ISSNs.

# Perform a left join to add the QUALIS variable to pubmed_data
pubmed_data <- pubmed_data %>%
        left_join(qualis_data %>% select(ISSN, QUALIS), by = "ISSN")

# Removing what comes after = in names in pub_med data
pubmed_data <- pubmed_data %>% 
        mutate(`Publication Title` = str_remove(`Publication Title`, "\\s*=.*$"))

# Display the first 36 rows to inspect the merge
pubmed_data %>% select(`Publication Title`, ISSN, QUALIS) %>% print(n = 36)
## # A tibble: 36 × 3
##    `Publication Title`                                              ISSN  QUALIS
##    <chr>                                                            <chr> <ord> 
##  1 International Journal of Cardiology                              1874… <NA>  
##  2 Journal of Public Health Dentistry                               1752… A3    
##  3 Frontiers in Pharmacology                                        1663… A1    
##  4 Tropical medicine & international health: TM & IH                1365… A1    
##  5 International Journal of Clinical Pharmacy                       2210… <NA>  
##  6 International Journal of Environmental Research and Public Heal… 1660… A2    
##  7 International Journal of Medical Informatics                     1872… <NA>  
##  8 Pharmacogenomics                                                 1744… <NA>  
##  9 PloS One                                                         1932… A1    
## 10 Journal of Medical Internet Research                             1438… A1    
## 11 Arquivos Brasileiros De Cardiologia                              1678… <NA>  
## 12 Brazilian Oral Research                                          1807… A2    
## 13 Archives of Gerontology and Geriatrics                           1872… <NA>  
## 14 Frontiers in Pharmacology                                        1663… A1    
## 15 Arquivos Brasileiros De Cardiologia                              1678… <NA>  
## 16 Oral Diseases                                                    1601… <NA>  
## 17 Research in social & administrative pharmacy: RSAP               1934… <NA>  
## 18 Revista Do Instituto De Medicina Tropical De Sao Paulo           1678… B1    
## 19 British Journal of Clinical Pharmacology                         1365… <NA>  
## 20 Journal of Pharmaceutical and Biomedical Analysis                1873… <NA>  
## 21 Internal and Emergency Medicine                                  1970… <NA>  
## 22 Exploratory Research in Clinical and Social Pharmacy             2667… <NA>  
## 23 BMC primary care                                                 2731… <NA>  
## 24 Current Medical Research and Opinion                             1473… <NA>  
## 25 Biomedicine & Pharmacotherapy                                    1950… <NA>  
## 26 European Journal of Clinical Pharmacology                        1432… <NA>  
## 27 Research in social & administrative pharmacy: RSAP               1934… <NA>  
## 28 JBI evidence synthesis                                           2689… <NA>  
## 29 Einstein (Sao Paulo, Brazil)                                     2317… <NA>  
## 30 Journal of Cranio-Maxillo-Facial Surgery: Official Publication … 1878… <NA>  
## 31 British Journal of Clinical Pharmacology                         1365… <NA>  
## 32 International Journal of Environmental Research and Public Heal… 1660… A2    
## 33 Journal of Thrombosis and Thrombolysis                           1573… <NA>  
## 34 Patient Education and Counseling                                 1873… <NA>  
## 35 JBI evidence synthesis                                           2689… <NA>  
## 36 International Journal of Clinical Pharmacy                       2210… <NA>

Step 4: Identify Unmatched Rows

Next, we inspect rows in pubmed_data where the QUALIS variable remains NA. These rows indicate that no direct match was found using ISSN, so we will need to attempt fuzzy matching based on journal titles.

# Check the rows with NA QUALIS to identify unmatched ISSNs
pubmed_data %>% filter(is.na(QUALIS)) %>% select(`Publication Title`, ISSN, QUALIS)
## # A tibble: 26 × 3
##    `Publication Title`                                ISSN      QUALIS
##    <chr>                                              <chr>     <ord> 
##  1 International Journal of Cardiology                1874-1754 <NA>  
##  2 International Journal of Clinical Pharmacy         2210-7711 <NA>  
##  3 International Journal of Medical Informatics       1872-8243 <NA>  
##  4 Pharmacogenomics                                   1744-8042 <NA>  
##  5 Arquivos Brasileiros De Cardiologia                1678-4170 <NA>  
##  6 Archives of Gerontology and Geriatrics             1872-6976 <NA>  
##  7 Arquivos Brasileiros De Cardiologia                1678-4170 <NA>  
##  8 Oral Diseases                                      1601-0825 <NA>  
##  9 Research in social & administrative pharmacy: RSAP 1934-8150 <NA>  
## 10 British Journal of Clinical Pharmacology           1365-2125 <NA>  
## # ℹ 16 more rows
# Filter rows with NA in QUALIS for further processing
unmatched <- pubmed_data %>% filter(is.na(QUALIS))

Step 5: Normalize Text for Fuzzy Matching

Since journal titles in both datasets may differ in capitalization, formatting, or additional text (e.g., secondary titles after =), we clean and normalize the Publication Title column in pubmed_data by:

Removing text after = (e.g., secondary titles). Converting all text to uppercase for consistent case-insensitive matching.

# Normalize text in unmatched
unmatched <- unmatched %>%
        # Convert to uppercase
        mutate(Publication_Title_Upper = toupper(`Publication Title`))

# Normalize text in qualis_data (only convert to uppercase, as other cleaning was done earlier)
qualis_data <- qualis_data %>%
        # Convert to uppercase
        mutate(Journal_Upper = toupper(Journal))

Step 6: Perform Fuzzy Matching on Journal Titles

For rows where the ISSN did not result in a match, we use fuzzy matching to find the most similar journal title in the qualis_data. The Jaro-Winkler method (method = "jw") is employed to calculate string distances, and the closest match is selected for each unmatched journal.

# Perform fuzzy matching
fuzzy_results <- unmatched %>%
        rowwise() %>%
        mutate(
                best_match = qualis_data$Journal[which.min(stringdist(Publication_Title_Upper, qualis_data$Journal_Upper, method = "jw"))],
                best_distance = min(stringdist(Publication_Title_Upper, qualis_data$Journal_Upper, method = "jw")),
                QUALIS_from_pdf = qualis_data$QUALIS[which.min(stringdist(Publication_Title_Upper, qualis_data$Journal_Upper, method = "jw"))]
        ) %>%
        ungroup()

# Inspect matches with distances
fuzzy_results %>%
        select(`Publication Title`, best_match, best_distance) %>%
        arrange(best_distance) %>% print(n = 26)
## # A tibble: 26 × 3
##    `Publication Title`                                  best_match best_distance
##    <chr>                                                <chr>              <dbl>
##  1 International Journal of Cardiology                  INTERNATI…        0     
##  2 International Journal of Clinical Pharmacy           INTERNATI…        0     
##  3 International Journal of Medical Informatics         INTERNATI…        0     
##  4 Archives of Gerontology and Geriatrics               ARCHIVES …        0     
##  5 Oral Diseases                                        ORAL DISE…        0     
##  6 British Journal of Clinical Pharmacology             BRITISH J…        0     
##  7 Journal of Pharmaceutical and Biomedical Analysis    JOURNAL O…        0     
##  8 Internal and Emergency Medicine                      Internal …        0     
##  9 Current Medical Research and Opinion                 CURRENT M…        0     
## 10 Biomedicine & Pharmacotherapy                        BIOMEDICI…        0     
## 11 European Journal of Clinical Pharmacology            EUROPEAN …        0     
## 12 British Journal of Clinical Pharmacology             BRITISH J…        0     
## 13 Journal of Thrombosis and Thrombolysis               JOURNAL O…        0     
## 14 Patient Education and Counseling                     PATIENT E…        0     
## 15 International Journal of Clinical Pharmacy           INTERNATI…        0     
## 16 Research in social & administrative pharmacy: RSAP   RESEARCH …        0.0400
## 17 Research in social & administrative pharmacy: RSAP   RESEARCH …        0.0400
## 18 Arquivos Brasileiros De Cardiologia                  ARQUIVOS …        0.0797
## 19 Arquivos Brasileiros De Cardiologia                  ARQUIVOS …        0.0797
## 20 Pharmacogenomics                                     PHARMACOG…        0.111 
## 21 Einstein (Sao Paulo, Brazil)                         EINSTEIN …        0.159 
## 22 Exploratory Research in Clinical and Social Pharmacy EXPERIMEN…        0.241 
## 23 Journal of Cranio-Maxillo-Facial Surgery: Official … JOURNAL O…        0.248 
## 24 BMC primary care                                     BMC GERIA…        0.265 
## 25 JBI evidence synthesis                               ISBT SCIE…        0.276 
## 26 JBI evidence synthesis                               ISBT SCIE…        0.276

Step 7: Filter and Save Good Matches

Finally, we filter fuzzy matching results to retain only those matches with a high degree of confidence (e.g., a string distance below 0.16). These good matches are displayed for further inspection and will later be merged back into the main dataset.

# Filter for good matches
good_matches <- fuzzy_results %>%
        filter(best_distance < 0.16) %>%
        select(`Publication Title`, best_match, QUALIS_from_pdf) %>%
        distinct()  # Remove duplicate rows

good_matches
## # A tibble: 17 × 3
##    `Publication Title`                                best_match QUALIS_from_pdf
##    <chr>                                              <chr>      <ord>          
##  1 International Journal of Cardiology                INTERNATI… A3             
##  2 International Journal of Clinical Pharmacy         INTERNATI… A4             
##  3 International Journal of Medical Informatics       INTERNATI… A2             
##  4 Pharmacogenomics                                   PHARMACOG… A2             
##  5 Arquivos Brasileiros De Cardiologia                ARQUIVOS … B1             
##  6 Archives of Gerontology and Geriatrics             ARCHIVES … A1             
##  7 Oral Diseases                                      ORAL DISE… A2             
##  8 Research in social & administrative pharmacy: RSAP RESEARCH … A3             
##  9 British Journal of Clinical Pharmacology           BRITISH J… A1             
## 10 Journal of Pharmaceutical and Biomedical Analysis  JOURNAL O… A2             
## 11 Internal and Emergency Medicine                    Internal … A2             
## 12 Current Medical Research and Opinion               CURRENT M… A1             
## 13 Biomedicine & Pharmacotherapy                      BIOMEDICI… A2             
## 14 European Journal of Clinical Pharmacology          EUROPEAN … A3             
## 15 Einstein (Sao Paulo, Brazil)                       EINSTEIN … B1             
## 16 Journal of Thrombosis and Thrombolysis             JOURNAL O… A2             
## 17 Patient Education and Counseling                   PATIENT E… A2

Step 8: Merge QUALIS_from_pdf into pubmed_data

Now that we have filtered and refined the good_matches dataset to retain only unique rows and excluded problematic matches, we can proceed to merge the QUALIS_from_pdf variable back into the pubmed_data dataset.

The merging is performed using the Publication Title column as the key. After the merge:

  1. Missing values (NA) in the QUALIS column of pubmed_data will be filled with corresponding values from QUALIS_from_pdf.

  2. Unnecessary columns, such as QUALIS_from_pdf, will be removed after the merge to keep the dataset clean.

The resulting pubmed_data dataset will now include the updated QUALIS values, incorporating matches from both the ISSN-based join and the fuzzy title matching process.

# Merge the variable QUALIS_from_pdf from good_matches into pubmed_data
pubmed_data <- pubmed_data %>%
        # Join on Publication Title
        left_join(
                # Selecting only the necessary variables
                good_matches %>% select("Publication Title",
                                        QUALIS_from_pdf),
                by = "Publication Title") %>%
        # Fill missing QUALIS with matched values
        mutate(QUALIS = coalesce(QUALIS, QUALIS_from_pdf)) %>%
        # Remove temporary column
        select(-QUALIS_from_pdf)

# Checking to see if there are any NA values in the QUALIS value at this point
pubmed_data %>% select(`Publication Title`, ISSN, QUALIS) %>% filter(is.na(QUALIS))
## # A tibble: 5 × 3
##   `Publication Title`                                               ISSN  QUALIS
##   <chr>                                                             <chr> <ord> 
## 1 Exploratory Research in Clinical and Social Pharmacy              2667… <NA>  
## 2 BMC primary care                                                  2731… <NA>  
## 3 JBI evidence synthesis                                            2689… <NA>  
## 4 Journal of Cranio-Maxillo-Facial Surgery: Official Publication o… 1878… <NA>  
## 5 JBI evidence synthesis                                            2689… <NA>

While there are still some NA values, I wasn’t able to find the QUALIS for those publications, even with searching manually on the web. So, those publications will remain without QUALIS.

Step 9: Save the Final Dataset

To ensure the processed pubmed_data dataset is available for the next step, we save it as a .rds file for easy access.

# Save the dataset as an RDS file
saveRDS(pubmed_data, "pubmed_data_with_qualis.rds")

With this done, all that remains is to write the scientfic report itself, which will be detailed in the next document.