OpenAlex Coverage Analysis

Introduction

This R Markdown document analyzes the coverage of scientific publications in OpenAlex, a free and open index of scholarly works. We’ll use a decision tree model to predict whether a journal is included in OpenAlex based on various characteristics.

Load Required Libraries

First, we’ll load the necessary R libraries for our analysis.

library(rpart)
library(caret)
library(e1071)
library(dplyr)
library(tidyr)
library(pROC)
library(rpart.plot)

Data Preprocessing

We’ll start by reading and preprocessing our data.

# Set seed for reproducibility
set.seed(123)

# Read and preprocess data
rawData <- read.table("C:\\Users\\dgenk\\Documentos Locales\\ScholCommLab\\OpenAlex Coverage\\data\\globalCoverageData.txt", sep = "\t", header = TRUE, na.strings=0)
data_for_cluster2 <- as.data.frame(rawData)

# Data preprocessing
data_for_cluster2 <- data_for_cluster2 %>% 
  mutate(across(c(numDocs2020, numDocs2021, numDocs2022, numDocs2023, GDPPerCapita, indScopus, indDOAJ, indDOI, OpenAlex, 
                  CrossRefMatchNumDocsFinal, DataCiteMatchNumDocs, mEDRAMatchNumDocs, JaLCMatchNumDocs, AiritiMatchNumDocs, 
                  DOIs, matchedDOIs, RepoSize, earliestYearPub, numDocsTotal, openPageRankDecEPAvg, openPageRankDecAvg, 
                  countryNumJournals), as.numeric)) %>%
  mutate(across(where(is.numeric), ~ replace_na(., 0))) %>%
  mutate(across(where(is.character), as.factor)) %>%
  mutate(across(c(indScopus, indDOAJ, indDOI, OpenAlex, CrossRef, Medra, JALC, Airiti, DataCite), as.factor)) %>%
  mutate(COUNTRY_CODE_3 = as.factor(COUNTRY_CODE_3))

data_for_cluster2 <- data_for_cluster2 %>% 
  mutate(CrossRef = as.factor(ifelse(CrossRefMatchNumDocsFinal > 1, 1, 0))) %>%
  rename(DOAJ = indDOAJ, Scopus = indScopus, DOI = indDOI)

# Feature selection
selection <- data_for_cluster2 %>% select(
  countryNumJournals, Scopus, DOAJ, OpenAlex,
  CrossRef, DataCiteMatchNumDocs, mEDRAMatchNumDocs,
  JaLCMatchNumDocs, AiritiMatchNumDocs, openPageRankDecEPAvg,
  openPageRankDecAvg, RepoSize, earliestYearPub, numDocsTotal, GDPPerCapita
)

data <- selection %>% 
  filter(countryNumJournals > 0) %>%
  mutate(OpenAlex = as.factor(ifelse(OpenAlex == 1, "Included", "Excluded"))) %>%
  filter(!is.na(OpenAlex))

# Display the first few rows and summary of the processed data
head(data)

##   countryNumJournals Scopus DOAJ OpenAlex CrossRef DataCiteMatchNumDocs
## 1                526      1    1 Included        1                    0
## 2                526      1    1 Included        1                    0
## 3                526      1    1 Included        1                    0
## 4                526      1    1 Included        0                   69
## 5                526      0    0 Included        1                    0
## 6                526      1    1 Included        1                    0
##   mEDRAMatchNumDocs JaLCMatchNumDocs AiritiMatchNumDocs openPageRankDecEPAvg
## 1                 0                0                  0             4.346667
## 2                 0                0                  0             2.250000
## 3                 0                0                  0             4.346667
## 4                 0                0                  0             2.866667
## 5                 0                0                  0             3.845000
## 6                 0                0                  0             2.926667
##   openPageRankDecAvg RepoSize earliestYearPub numDocsTotal GDPPerCapita
## 1               3.32      203               0            0     12973.03
## 2               0.00      203               0            0     12973.03
## 3               4.99      203               0            0     12973.03
## 4               2.36      203               0            0     12973.03
## 5               2.39      203               0            0     12973.03
## 6               0.00      203               0            0     12973.03

summary(data)

##  countryNumJournals Scopus    DOAJ          OpenAlex     CrossRef 
##  Min.   :    1      0:43537   0:38627   Excluded:13839   0:24956  
##  1st Qu.:  612      1: 3625   1: 8535   Included:33323   1:22206  
##  Median : 4374                                                    
##  Mean   :10840                                                    
##  3rd Qu.:21885                                                    
##  Max.   :21885                                                    
##  DataCiteMatchNumDocs mEDRAMatchNumDocs  JaLCMatchNumDocs  AiritiMatchNumDocs
##  Min.   :   0.000     Min.   :  0.0000   Min.   :0.0e+00   Min.   : 0.00000  
##  1st Qu.:   0.000     1st Qu.:  0.0000   1st Qu.:0.0e+00   1st Qu.: 0.00000  
##  Median :   0.000     Median :  0.0000   Median :0.0e+00   Median : 0.00000  
##  Mean   :   1.658     Mean   :  0.2208   Mean   :7.4e-04   Mean   : 0.00119  
##  3rd Qu.:   0.000     3rd Qu.:  0.0000   3rd Qu.:0.0e+00   3rd Qu.: 0.00000  
##  Max.   :7014.000     Max.   :598.0000   Max.   :3.5e+01   Max.   :56.00000  
##  openPageRankDecEPAvg openPageRankDecAvg    RepoSize      earliestYearPub
##  Min.   :0.000        Min.   :0.000      Min.   :  1.00   Min.   :   0   
##  1st Qu.:2.330        1st Qu.:0.000      1st Qu.:  2.00   1st Qu.:2016   
##  Median :3.061        Median :1.950      Median : 10.00   Median :2019   
##  Mean   :3.074        Mean   :1.832      Mean   : 42.31   Mean   :1998   
##  3rd Qu.:4.040        3rd Qu.:3.360      3rd Qu.: 32.00   3rd Qu.:2022   
##  Max.   :5.920        Max.   :7.400      Max.   :909.00   Max.   :2024   
##   numDocsTotal      GDPPerCapita   
##  Min.   :    0.0   Min.   :     0  
##  1st Qu.:   37.0   1st Qu.:  4490  
##  Median :   95.0   Median :  4490  
##  Mean   :  223.9   Mean   : 12095  
##  3rd Qu.:  220.0   3rd Qu.:  9500  
##  Max.   :35923.0   Max.   :125971

Model Training and Evaluation

We’ll use nested cross-validation to train and evaluate our decision tree model.

# Define the outer cross-validation
outer_cv <- trainControl(
  method = "cv", 
  number = 5, 
  savePredictions = "final",
  classProbs = TRUE,
  summaryFunction = twoClassSummary
)

# Define the inner cross-validation
inner_cv <- trainControl(
  method = "cv", 
  number = 3,
  classProbs = TRUE,
  summaryFunction = twoClassSummary
)

# Train the model
model_spec <- train(
  OpenAlex ~ .,
  data = data,
  method = "rpart",
  trControl = inner_cv,
  tuneLength = 10,
  metric = "ROC"
)

# Perform nested cross-validation
nested_cv_results <- train(
  OpenAlex ~ .,
  data = data,
  method = "rpart",
  trControl = outer_cv,
  tuneGrid = model_spec$bestTune,
  metric = "ROC"
)

# Print the results
print(nested_cv_results)

## CART 
## 
## 47162 samples
##    14 predictor
##     2 classes: 'Excluded', 'Included' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 37729, 37729, 37729, 37730, 37731 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.8623988  0.7677582  0.8238459
## 
## Tuning parameter 'cp' was held constant at a value of 0.002745863

Model Performance Metrics

Let’s calculate and display various performance metrics for our model.

# Calculate overall metrics
predictions <- nested_cv_results$pred
conf_matrix <- confusionMatrix(predictions$pred, predictions$obs)
accuracy <- conf_matrix$overall["Accuracy"]
precision <- conf_matrix$byClass["Precision"]
recall <- conf_matrix$byClass["Recall"]
f1_score <- conf_matrix$byClass["F1"]

# Print metrics
cat("Overall accuracy:", round(accuracy, 4), "\n")

## Overall accuracy: 0.8074

cat("Overall precision:", round(precision, 4), "\n")

## Overall precision: 0.6441

cat("Overall recall:", round(recall, 4), "\n")

## Overall recall: 0.7678

cat("Overall F1 score:", round(f1_score, 4), "\n")

## Overall F1 score: 0.7005

# Calculate and plot ROC curve
roc_obj <- roc(predictions$obs, predictions[, "Included"])
auc_value <- auc(roc_obj)

plot(roc_obj, main = paste("ROC Curve (AUC =", round(auc_value, 4), ")"))

cat("Area Under the Curve (AUC):", round(auc_value, 4), "\n")

## Area Under the Curve (AUC): 0.8617

Variable Importance

Let’s examine which variables are most important in our model.

# Plot variable importance
importance <- varImp(nested_cv_results)
plot(importance)

Final Model and Decision Tree Visualization

Finally, we’ll fit a final model on the entire dataset and visualize the decision tree.

# Fit final model on entire dataset
final_model <- rpart(OpenAlex ~ ., data = data, 
                     control = rpart.control(cp = 0.01))

# Plot the decision tree
rpart.plot(final_model, 
           extra = 107,
           box.palette = "Greens",
           split.cex = 0.8,
           cex = 0.9,
           uniform = TRUE,
           branch = 0.5,
           main = "Journal Characteristics Using DOIs",
           under = TRUE,
           compress = TRUE)

# Display decision rules
rules <- rpart.rules(final_model)
print(rules)

##  OpenAlex                                             
##      0.41 when CrossRef is 0 & DOAJ is 0 & Scopus is 0
##      0.79 when CrossRef is 0 & DOAJ is 1              
##      0.81 when CrossRef is 0 & DOAJ is 0 & Scopus is 1
##      0.97 when CrossRef is 1

Conclusion

This analysis provides insights into the factors that influence a journal’s inclusion in OpenAlex. The decision tree model helps us understand which characteristics are most important in predicting OpenAlex coverage.

Key findings: 1. Model Performance:

Accuracy: 0.8073873
Precision: 0.6441346
Recall: 0.7677578
F1 Score: 0.7005341
Area Under the Curve (AUC): 0.8616583

We are including language and discipline at the moment

OpenAlex Coverage Analysis

Diego Chavarro and Juan Pablo Alperín

2024-10-11

Introduction

Load Required Libraries

Data Preprocessing

Model Training and Evaluation

Model Performance Metrics

Variable Importance

Final Model and Decision Tree Visualization

Conclusion

Conclusion