This document details the process of merging and enriching data extracted from PubMed with QUALIS classification data, obtained from a PDF. The main goal is to assign QUALIS grades to journals listed in the PubMed dataset based on their ISSNs or, when no direct match is found, using fuzzy matching on journal titles. This report will guide you through each step of the process.
The first step is to load the necessary libraries. The
tidyverse package is used for data manipulation and
cleaning, while the stringdist package is employed for
fuzzy string matching. We also load the pre-saved
qualis_data object from an RDS file, which
contains the QUALIS grades extracted from the PDF.
The process of generating qualis_data was detailed here.
# Loading required libraries
library(tidyverse)
library(stringdist)
# Load pre-saved qualis_data object
qualis_data <- readRDS("qualis_data.rds")
Here, we load the dataset containing information about the researcher’s publications from PubMed. This file includes variables such as ISSN, journal title, and other metadata about the publications.
# Reading in data extracted from PUBMED
pubmed_data <- read_csv("Maria Auxiliadora Parreiras Martins.csv",
show_col_types = FALSE)
In this step, we attempt to directly match and merge the QUALIS
grades from the qualis_data dataset into the
pubmed_data dataset using the ISSN column as a key. This
straightforward join should capture all journals with directly matching
ISSNs.
# Perform a left join to add the QUALIS variable to pubmed_data
pubmed_data <- pubmed_data %>%
left_join(qualis_data %>% select(ISSN, QUALIS), by = "ISSN")
# Removing what comes after = in names in pub_med data
pubmed_data <- pubmed_data %>%
mutate(`Publication Title` = str_remove(`Publication Title`, "\\s*=.*$"))
# Display the first 36 rows to inspect the merge
pubmed_data %>% select(`Publication Title`, ISSN, QUALIS) %>% print(n = 36)
## # A tibble: 36 × 3
## `Publication Title` ISSN QUALIS
## <chr> <chr> <ord>
## 1 International Journal of Cardiology 1874… <NA>
## 2 Journal of Public Health Dentistry 1752… A3
## 3 Frontiers in Pharmacology 1663… A1
## 4 Tropical medicine & international health: TM & IH 1365… A1
## 5 International Journal of Clinical Pharmacy 2210… <NA>
## 6 International Journal of Environmental Research and Public Heal… 1660… A2
## 7 International Journal of Medical Informatics 1872… <NA>
## 8 Pharmacogenomics 1744… <NA>
## 9 PloS One 1932… A1
## 10 Journal of Medical Internet Research 1438… A1
## 11 Arquivos Brasileiros De Cardiologia 1678… <NA>
## 12 Brazilian Oral Research 1807… A2
## 13 Archives of Gerontology and Geriatrics 1872… <NA>
## 14 Frontiers in Pharmacology 1663… A1
## 15 Arquivos Brasileiros De Cardiologia 1678… <NA>
## 16 Oral Diseases 1601… <NA>
## 17 Research in social & administrative pharmacy: RSAP 1934… <NA>
## 18 Revista Do Instituto De Medicina Tropical De Sao Paulo 1678… B1
## 19 British Journal of Clinical Pharmacology 1365… <NA>
## 20 Journal of Pharmaceutical and Biomedical Analysis 1873… <NA>
## 21 Internal and Emergency Medicine 1970… <NA>
## 22 Exploratory Research in Clinical and Social Pharmacy 2667… <NA>
## 23 BMC primary care 2731… <NA>
## 24 Current Medical Research and Opinion 1473… <NA>
## 25 Biomedicine & Pharmacotherapy 1950… <NA>
## 26 European Journal of Clinical Pharmacology 1432… <NA>
## 27 Research in social & administrative pharmacy: RSAP 1934… <NA>
## 28 JBI evidence synthesis 2689… <NA>
## 29 Einstein (Sao Paulo, Brazil) 2317… <NA>
## 30 Journal of Cranio-Maxillo-Facial Surgery: Official Publication … 1878… <NA>
## 31 British Journal of Clinical Pharmacology 1365… <NA>
## 32 International Journal of Environmental Research and Public Heal… 1660… A2
## 33 Journal of Thrombosis and Thrombolysis 1573… <NA>
## 34 Patient Education and Counseling 1873… <NA>
## 35 JBI evidence synthesis 2689… <NA>
## 36 International Journal of Clinical Pharmacy 2210… <NA>
Next, we inspect rows in pubmed_data where the QUALIS
variable remains NA. These rows indicate that no direct
match was found using ISSN, so we will need to attempt fuzzy matching
based on journal titles.
# Check the rows with NA QUALIS to identify unmatched ISSNs
pubmed_data %>% filter(is.na(QUALIS)) %>% select(`Publication Title`, ISSN, QUALIS)
## # A tibble: 26 × 3
## `Publication Title` ISSN QUALIS
## <chr> <chr> <ord>
## 1 International Journal of Cardiology 1874-1754 <NA>
## 2 International Journal of Clinical Pharmacy 2210-7711 <NA>
## 3 International Journal of Medical Informatics 1872-8243 <NA>
## 4 Pharmacogenomics 1744-8042 <NA>
## 5 Arquivos Brasileiros De Cardiologia 1678-4170 <NA>
## 6 Archives of Gerontology and Geriatrics 1872-6976 <NA>
## 7 Arquivos Brasileiros De Cardiologia 1678-4170 <NA>
## 8 Oral Diseases 1601-0825 <NA>
## 9 Research in social & administrative pharmacy: RSAP 1934-8150 <NA>
## 10 British Journal of Clinical Pharmacology 1365-2125 <NA>
## # ℹ 16 more rows
# Filter rows with NA in QUALIS for further processing
unmatched <- pubmed_data %>% filter(is.na(QUALIS))
Since journal titles in both datasets may differ in capitalization,
formatting, or additional text (e.g., secondary titles after
=), we clean and normalize the
Publication Title column in pubmed_data
by:
Removing text after = (e.g., secondary titles).
Converting all text to uppercase for consistent case-insensitive
matching.
# Normalize text in unmatched
unmatched <- unmatched %>%
# Convert to uppercase
mutate(Publication_Title_Upper = toupper(`Publication Title`))
# Normalize text in qualis_data (only convert to uppercase, as other cleaning was done earlier)
qualis_data <- qualis_data %>%
# Convert to uppercase
mutate(Journal_Upper = toupper(Journal))
For rows where the ISSN did not result in a match, we use fuzzy
matching to find the most similar journal title in the
qualis_data. The Jaro-Winkler method
(method = "jw") is employed to calculate string distances,
and the closest match is selected for each unmatched journal.
# Perform fuzzy matching
fuzzy_results <- unmatched %>%
rowwise() %>%
mutate(
best_match = qualis_data$Journal[which.min(stringdist(Publication_Title_Upper, qualis_data$Journal_Upper, method = "jw"))],
best_distance = min(stringdist(Publication_Title_Upper, qualis_data$Journal_Upper, method = "jw")),
QUALIS_from_pdf = qualis_data$QUALIS[which.min(stringdist(Publication_Title_Upper, qualis_data$Journal_Upper, method = "jw"))]
) %>%
ungroup()
# Inspect matches with distances
fuzzy_results %>%
select(`Publication Title`, best_match, best_distance) %>%
arrange(best_distance) %>% print(n = 26)
## # A tibble: 26 × 3
## `Publication Title` best_match best_distance
## <chr> <chr> <dbl>
## 1 International Journal of Cardiology INTERNATI… 0
## 2 International Journal of Clinical Pharmacy INTERNATI… 0
## 3 International Journal of Medical Informatics INTERNATI… 0
## 4 Archives of Gerontology and Geriatrics ARCHIVES … 0
## 5 Oral Diseases ORAL DISE… 0
## 6 British Journal of Clinical Pharmacology BRITISH J… 0
## 7 Journal of Pharmaceutical and Biomedical Analysis JOURNAL O… 0
## 8 Internal and Emergency Medicine Internal … 0
## 9 Current Medical Research and Opinion CURRENT M… 0
## 10 Biomedicine & Pharmacotherapy BIOMEDICI… 0
## 11 European Journal of Clinical Pharmacology EUROPEAN … 0
## 12 British Journal of Clinical Pharmacology BRITISH J… 0
## 13 Journal of Thrombosis and Thrombolysis JOURNAL O… 0
## 14 Patient Education and Counseling PATIENT E… 0
## 15 International Journal of Clinical Pharmacy INTERNATI… 0
## 16 Research in social & administrative pharmacy: RSAP RESEARCH … 0.0400
## 17 Research in social & administrative pharmacy: RSAP RESEARCH … 0.0400
## 18 Arquivos Brasileiros De Cardiologia ARQUIVOS … 0.0797
## 19 Arquivos Brasileiros De Cardiologia ARQUIVOS … 0.0797
## 20 Pharmacogenomics PHARMACOG… 0.111
## 21 Einstein (Sao Paulo, Brazil) EINSTEIN … 0.159
## 22 Exploratory Research in Clinical and Social Pharmacy EXPERIMEN… 0.241
## 23 Journal of Cranio-Maxillo-Facial Surgery: Official … JOURNAL O… 0.248
## 24 BMC primary care BMC GERIA… 0.265
## 25 JBI evidence synthesis ISBT SCIE… 0.276
## 26 JBI evidence synthesis ISBT SCIE… 0.276
Finally, we filter fuzzy matching results to retain only those matches with a high degree of confidence (e.g., a string distance below 0.16). These good matches are displayed for further inspection and will later be merged back into the main dataset.
# Filter for good matches
good_matches <- fuzzy_results %>%
filter(best_distance < 0.16) %>%
select(`Publication Title`, best_match, QUALIS_from_pdf) %>%
distinct() # Remove duplicate rows
good_matches
## # A tibble: 17 × 3
## `Publication Title` best_match QUALIS_from_pdf
## <chr> <chr> <ord>
## 1 International Journal of Cardiology INTERNATI… A3
## 2 International Journal of Clinical Pharmacy INTERNATI… A4
## 3 International Journal of Medical Informatics INTERNATI… A2
## 4 Pharmacogenomics PHARMACOG… A2
## 5 Arquivos Brasileiros De Cardiologia ARQUIVOS … B1
## 6 Archives of Gerontology and Geriatrics ARCHIVES … A1
## 7 Oral Diseases ORAL DISE… A2
## 8 Research in social & administrative pharmacy: RSAP RESEARCH … A3
## 9 British Journal of Clinical Pharmacology BRITISH J… A1
## 10 Journal of Pharmaceutical and Biomedical Analysis JOURNAL O… A2
## 11 Internal and Emergency Medicine Internal … A2
## 12 Current Medical Research and Opinion CURRENT M… A1
## 13 Biomedicine & Pharmacotherapy BIOMEDICI… A2
## 14 European Journal of Clinical Pharmacology EUROPEAN … A3
## 15 Einstein (Sao Paulo, Brazil) EINSTEIN … B1
## 16 Journal of Thrombosis and Thrombolysis JOURNAL O… A2
## 17 Patient Education and Counseling PATIENT E… A2
Now that we have filtered and refined the good_matches
dataset to retain only unique rows and excluded problematic matches, we
can proceed to merge the QUALIS_from_pdf variable back into
the pubmed_data dataset.
The merging is performed using the Publication Title
column as the key. After the merge:
Missing values (NA) in the QUALIS
column of pubmed_data will be filled with corresponding
values from QUALIS_from_pdf.
Unnecessary columns, such as QUALIS_from_pdf, will
be removed after the merge to keep the dataset clean.
The resulting pubmed_data dataset will now include the
updated QUALIS values, incorporating matches from both the
ISSN-based join and the fuzzy title matching process.
# Merge the variable QUALIS_from_pdf from good_matches into pubmed_data
pubmed_data <- pubmed_data %>%
# Join on Publication Title
left_join(
# Selecting only the necessary variables
good_matches %>% select("Publication Title",
QUALIS_from_pdf),
by = "Publication Title") %>%
# Fill missing QUALIS with matched values
mutate(QUALIS = coalesce(QUALIS, QUALIS_from_pdf)) %>%
# Remove temporary column
select(-QUALIS_from_pdf)
# Checking to see if there are any NA values in the QUALIS value at this point
pubmed_data %>% select(`Publication Title`, ISSN, QUALIS) %>% filter(is.na(QUALIS))
## # A tibble: 5 × 3
## `Publication Title` ISSN QUALIS
## <chr> <chr> <ord>
## 1 Exploratory Research in Clinical and Social Pharmacy 2667… <NA>
## 2 BMC primary care 2731… <NA>
## 3 JBI evidence synthesis 2689… <NA>
## 4 Journal of Cranio-Maxillo-Facial Surgery: Official Publication o… 1878… <NA>
## 5 JBI evidence synthesis 2689… <NA>
While there are still some NA values, I wasn’t able to
find the QUALIS for those publications, even with searching
manually on the web. So, those publications will remain without
QUALIS.
To ensure the processed pubmed_data dataset is available
for the next step, we save it as a .rds file for easy
access.
# Save the dataset as an RDS file
saveRDS(pubmed_data, "pubmed_data_with_qualis.rds")
With this done, all that remains is to write the scientfic report itself, which will be detailed in the next document.