This R Markdown document analyzes the coverage of scientific publications in OpenAlex, a free and open index of scholarly works. We’ll use a decision tree model to predict whether a journal is included in OpenAlex based on various characteristics.
First, we’ll load the necessary R libraries for our analysis.
library(rpart)
library(caret)
library(e1071)
library(dplyr)
library(tidyr)
library(pROC)
library(rpart.plot)
We’ll start by reading and preprocessing our data.
# Set seed for reproducibility
set.seed(123)
# Read and preprocess data
rawData <- read.table("C:\\Users\\dgenk\\Documentos Locales\\ScholCommLab\\OpenAlex Coverage\\data\\globalCoverageData.txt", sep = "\t", header = TRUE, na.strings=0)
data_for_cluster2 <- as.data.frame(rawData)
# Data preprocessing
data_for_cluster2 <- data_for_cluster2 %>%
mutate(across(c(numDocs2020, numDocs2021, numDocs2022, numDocs2023, GDPPerCapita, indScopus, indDOAJ, indDOI, OpenAlex,
CrossRefMatchNumDocsFinal, DataCiteMatchNumDocs, mEDRAMatchNumDocs, JaLCMatchNumDocs, AiritiMatchNumDocs,
DOIs, matchedDOIs, RepoSize, earliestYearPub, numDocsTotal, openPageRankDecEPAvg, openPageRankDecAvg,
countryNumJournals), as.numeric)) %>%
mutate(across(where(is.numeric), ~ replace_na(., 0))) %>%
mutate(across(where(is.character), as.factor)) %>%
mutate(across(c(indScopus, indDOAJ, indDOI, OpenAlex, CrossRef, Medra, JALC, Airiti, DataCite), as.factor)) %>%
mutate(COUNTRY_CODE_3 = as.factor(COUNTRY_CODE_3))
data_for_cluster2 <- data_for_cluster2 %>%
mutate(CrossRef = as.factor(ifelse(CrossRefMatchNumDocsFinal > 1, 1, 0))) %>%
rename(DOAJ = indDOAJ, Scopus = indScopus, DOI = indDOI)
# Feature selection
selection <- data_for_cluster2 %>% select(
countryNumJournals, Scopus, DOAJ, OpenAlex,
CrossRef, DataCiteMatchNumDocs, mEDRAMatchNumDocs,
JaLCMatchNumDocs, AiritiMatchNumDocs, openPageRankDecEPAvg,
openPageRankDecAvg, RepoSize, earliestYearPub, numDocsTotal, GDPPerCapita
)
data <- selection %>%
filter(countryNumJournals > 0) %>%
mutate(OpenAlex = as.factor(ifelse(OpenAlex == 1, "Included", "Excluded"))) %>%
filter(!is.na(OpenAlex))
# Display the first few rows and summary of the processed data
head(data)
## countryNumJournals Scopus DOAJ OpenAlex CrossRef DataCiteMatchNumDocs
## 1 526 1 1 Included 1 0
## 2 526 1 1 Included 1 0
## 3 526 1 1 Included 1 0
## 4 526 1 1 Included 0 69
## 5 526 0 0 Included 1 0
## 6 526 1 1 Included 1 0
## mEDRAMatchNumDocs JaLCMatchNumDocs AiritiMatchNumDocs openPageRankDecEPAvg
## 1 0 0 0 4.346667
## 2 0 0 0 2.250000
## 3 0 0 0 4.346667
## 4 0 0 0 2.866667
## 5 0 0 0 3.845000
## 6 0 0 0 2.926667
## openPageRankDecAvg RepoSize earliestYearPub numDocsTotal GDPPerCapita
## 1 3.32 203 0 0 12973.03
## 2 0.00 203 0 0 12973.03
## 3 4.99 203 0 0 12973.03
## 4 2.36 203 0 0 12973.03
## 5 2.39 203 0 0 12973.03
## 6 0.00 203 0 0 12973.03
summary(data)
## countryNumJournals Scopus DOAJ OpenAlex CrossRef
## Min. : 1 0:43537 0:38627 Excluded:13839 0:24956
## 1st Qu.: 612 1: 3625 1: 8535 Included:33323 1:22206
## Median : 4374
## Mean :10840
## 3rd Qu.:21885
## Max. :21885
## DataCiteMatchNumDocs mEDRAMatchNumDocs JaLCMatchNumDocs AiritiMatchNumDocs
## Min. : 0.000 Min. : 0.0000 Min. :0.0e+00 Min. : 0.00000
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.:0.0e+00 1st Qu.: 0.00000
## Median : 0.000 Median : 0.0000 Median :0.0e+00 Median : 0.00000
## Mean : 1.658 Mean : 0.2208 Mean :7.4e-04 Mean : 0.00119
## 3rd Qu.: 0.000 3rd Qu.: 0.0000 3rd Qu.:0.0e+00 3rd Qu.: 0.00000
## Max. :7014.000 Max. :598.0000 Max. :3.5e+01 Max. :56.00000
## openPageRankDecEPAvg openPageRankDecAvg RepoSize earliestYearPub
## Min. :0.000 Min. :0.000 Min. : 1.00 Min. : 0
## 1st Qu.:2.330 1st Qu.:0.000 1st Qu.: 2.00 1st Qu.:2016
## Median :3.061 Median :1.950 Median : 10.00 Median :2019
## Mean :3.074 Mean :1.832 Mean : 42.31 Mean :1998
## 3rd Qu.:4.040 3rd Qu.:3.360 3rd Qu.: 32.00 3rd Qu.:2022
## Max. :5.920 Max. :7.400 Max. :909.00 Max. :2024
## numDocsTotal GDPPerCapita
## Min. : 0.0 Min. : 0
## 1st Qu.: 37.0 1st Qu.: 4490
## Median : 95.0 Median : 4490
## Mean : 223.9 Mean : 12095
## 3rd Qu.: 220.0 3rd Qu.: 9500
## Max. :35923.0 Max. :125971
We’ll use nested cross-validation to train and evaluate our decision tree model.
# Define the outer cross-validation
outer_cv <- trainControl(
method = "cv",
number = 5,
savePredictions = "final",
classProbs = TRUE,
summaryFunction = twoClassSummary
)
# Define the inner cross-validation
inner_cv <- trainControl(
method = "cv",
number = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
# Train the model
model_spec <- train(
OpenAlex ~ .,
data = data,
method = "rpart",
trControl = inner_cv,
tuneLength = 10,
metric = "ROC"
)
# Perform nested cross-validation
nested_cv_results <- train(
OpenAlex ~ .,
data = data,
method = "rpart",
trControl = outer_cv,
tuneGrid = model_spec$bestTune,
metric = "ROC"
)
# Print the results
print(nested_cv_results)
## CART
##
## 47162 samples
## 14 predictor
## 2 classes: 'Excluded', 'Included'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 37729, 37729, 37729, 37730, 37731
## Resampling results:
##
## ROC Sens Spec
## 0.8623988 0.7677582 0.8238459
##
## Tuning parameter 'cp' was held constant at a value of 0.002745863
Let’s calculate and display various performance metrics for our model.
# Calculate overall metrics
predictions <- nested_cv_results$pred
conf_matrix <- confusionMatrix(predictions$pred, predictions$obs)
accuracy <- conf_matrix$overall["Accuracy"]
precision <- conf_matrix$byClass["Precision"]
recall <- conf_matrix$byClass["Recall"]
f1_score <- conf_matrix$byClass["F1"]
# Print metrics
cat("Overall accuracy:", round(accuracy, 4), "\n")
## Overall accuracy: 0.8074
cat("Overall precision:", round(precision, 4), "\n")
## Overall precision: 0.6441
cat("Overall recall:", round(recall, 4), "\n")
## Overall recall: 0.7678
cat("Overall F1 score:", round(f1_score, 4), "\n")
## Overall F1 score: 0.7005
# Calculate and plot ROC curve
roc_obj <- roc(predictions$obs, predictions[, "Included"])
auc_value <- auc(roc_obj)
plot(roc_obj, main = paste("ROC Curve (AUC =", round(auc_value, 4), ")"))
cat("Area Under the Curve (AUC):", round(auc_value, 4), "\n")
## Area Under the Curve (AUC): 0.8617
Let’s examine which variables are most important in our model.
# Plot variable importance
importance <- varImp(nested_cv_results)
plot(importance)
Finally, we’ll fit a final model on the entire dataset and visualize the decision tree.
# Fit final model on entire dataset
final_model <- rpart(OpenAlex ~ ., data = data,
control = rpart.control(cp = 0.01))
# Plot the decision tree
rpart.plot(final_model,
extra = 107,
box.palette = "Greens",
split.cex = 0.8,
cex = 0.9,
uniform = TRUE,
branch = 0.5,
main = "Journal Characteristics Using DOIs",
under = TRUE,
compress = TRUE)
# Display decision rules
rules <- rpart.rules(final_model)
print(rules)
## OpenAlex
## 0.41 when CrossRef is 0 & DOAJ is 0 & Scopus is 0
## 0.79 when CrossRef is 0 & DOAJ is 1
## 0.81 when CrossRef is 0 & DOAJ is 0 & Scopus is 1
## 0.97 when CrossRef is 1
This analysis provides insights into the factors that influence a journal’s inclusion in OpenAlex. The decision tree model helps us understand which characteristics are most important in predicting OpenAlex coverage.
Key findings: 1. Model Performance:
Accuracy: 0.8073873
Precision: 0.6441346
Recall: 0.7677578
F1 Score: 0.7005341
Area Under the Curve (AUC): 0.8616583
We are including language and discipline at the moment