1 Introduction

This report analyses biomedical publication data from 1998 to 1999 to model and predict which ones will be influential and change the field and which ones to disregard. The core analytical method can be though of like a ‘recipe book’ as the most influential recipes rarely are from standard ingredients or overly radical combinations, instead they bridge a gap between familiar ingredients but in new parings. And this is what we are going to do with this dataset.

2 Part 1: Exploratory Data Analysis and LASSO Regression

2.1 Q1.1.1 Correlation Matrix and Pairs

# --- Data Setup ---
library(tidyverse)
library(glmnet)
library(mgcv)
library(rpart)
library(rpart.plot)
library(randomForest)
library(pdp)
library(caret)
library(pROC)
library(corrplot)
library(e1071)
library(patchwork)
library(tinytex)

DATA_PATH <- "C:/Users/tobia/Documents/Advanced Analytics & ML/pubmed_enhanced_major_B_1998_1999_enhanced_merged_func1_func2_network_features_Assignment2_selected_rows_final.csv"

df_full <- read_csv(DATA_PATH, show_col_types = FALSE)

df_full <- df_full %>%
  mutate(
    LogCited            = log(CitedCount + 1),
    HighCitation        = as.integer(CitedCount > median(CitedCount, na.rm = TRUE)),
    HasCommercialImpact = as.integer(CommercialCitations > 0),
    across(
      c(Degree_Avg, Betweenness_Avg, PageRank_Avg, Core_Number_Avg),
      ~replace_na(., 0)
    )
  )

df_active <- df_full %>% filter(NoveltyScore_v1 > 0)
cat(sprintf("Track A (df_full): %d rows\n", nrow(df_full)))
## Track A (df_full): 17728 rows
cat(sprintf("Track B (df_active): %d rows\n", nrow(df_active)))
## Track B (df_active): 8159 rows
num_cols <- df_active %>%
  dplyr::select(
    CitedCount, AuthorCount, TopicCount, DomainCount,
    NoveltyScore_v1, NoveltyScore_v2, NoveltyScore_v3,
    CombinationNovelty, BridgingScore_Mean, Consolidation_Mean,
    Clustering_Mean, CorePosition_Mean, TopicPairs, BridgingPairs,
    Degree_Avg, Core_Number_Avg, PageRank_Avg, Betweenness_Avg
  )

zero_sd_cols <- sapply(num_cols, function(x) sd(x, na.rm = TRUE) == 0)

num_cols_clean <- num_cols[, !zero_sd_cols]

cor_mat <- cor(num_cols_clean, use = "pairwise.complete.obs")

corrplot(
  cor_mat,
  method = "color",
  type = "upper",
  tl.cex = 0.8,
  tl.col = "black",
  addCoef.col = "black",
  number.cex = 0.55,
  diag = FALSE
)

high_pairs <- which(
  abs(cor_mat) > 0.7 & abs(cor_mat) < 1,
  arr.ind = TRUE
)

high_pairs_df <- data.frame(
  Var1 = rownames(cor_mat)[high_pairs[, 1]],
  Var2 = colnames(cor_mat)[high_pairs[, 2]],
  Correlation = cor_mat[high_pairs]
) %>%
  filter(Var1 < Var2) %>%
  arrange(desc(abs(Correlation)))

head(high_pairs_df, 5)
##              Var1       Var2 Correlation
## 1      TopicCount TopicPairs   0.9765364
## 2 Core_Number_Avg Degree_Avg   0.8660314
## 3 NoveltyScore_v1 TopicCount   0.8584413
## 4 NoveltyScore_v2 TopicCount   0.8584413
## 5 NoveltyScore_v3 TopicCount   0.8584413

When working with network data, we have to deal with a specific problem called structural zeros. Our full dataset (Track A) has 17,728 rows. However, ten of our network variables are calculated using pairs of topics. If a paper only talks about one topic, it doesn’t have any pairs, so the data pipeline gives it a 0. These aren’t real zeros; they just mean the metric cannot be calculated for that paper. To fix this, we use Track B, which filters the data to 8,159 rows where NoveltyScore_v1 > 0. This cleans our data so we only look at papers that actually combine different ideas. Also, BridgingScore_Mean and BridgingPairs had a standard deviation of exactly zero, which means no papers in our sample bridge completely distant fields. We dropped them so our models stay stable.Our correlation matrix shows that many variables are highly correlated. Some of this is mechanical. For example, TopicCount and TopicPairs have a correlation of \(r = 0.9765\) because having more keywords automatically creates more pairs. Other correlations are substantive, like author Degree_Avg and PageRank_Avg (\(r = 0.7608\)), showing that well-connected authors consistently work together. Our scatter plots show that while core positions look linear, author degree centralities curve and flatten out at the top, which shows that massive teams hit a point of diminishing returns.Finally, our target variable CitedCount is heavily right-skewed with a skewness value of 16.16. A few papers get thousands of citations while most get almost zero, which breaks standard linear assumptions. Applying a \(\log(\text{CitedCount} + 1)\) transformation compresses this long tail and brings the skewness down to a near-perfect 0.013, making it safe for regression. ## Q1.1.2 Scatter Plots

p1 <- ggplot(df_active, aes(x = NoveltyScore_v1, y = LogCited)) +
  geom_point(alpha = 0.3, size = 0.6) +
  geom_smooth(method = "loess", se = FALSE, colour = "firebrick") +
  theme_minimal()

p2 <- ggplot(df_active, aes(x = CorePosition_Mean, y = LogCited)) +
  geom_point(alpha = 0.3, size = 0.6) +
  geom_smooth(method = "loess", se = FALSE, colour = "firebrick") +
  theme_minimal()

p3 <- ggplot(df_active, aes(x = Degree_Avg, y = LogCited)) +
  geom_point(alpha = 0.3, size = 0.6) +
  geom_smooth(method = "loess", se = FALSE, colour = "firebrick") +
  theme_minimal()

p4 <- ggplot(df_active, aes(x = TopicCount, y = LogCited)) +
  geom_point(alpha = 0.3, size = 0.6) +
  geom_smooth(method = "loess", se = FALSE, colour = "firebrick") +
  theme_minimal()

(p1 | p2) / (p3 | p4)

Because our predictors are highly correlated, we used LASSO regression to select features. LASSO penalizes the model and shrinks less important variables to exactly zero. We chose the lambda.1se penalty (0.0909) instead of lambda.min (0.0010) because it gives us a simpler, more interpretable model and protects against overfitting. The model kept seven variables: AuthorCount, TopicCount, NoveltyScore_v1, NoveltyScore_v2, NoveltyScore_v3, Consolidation_Mean, and PageRank_Avg. This shows that citations are driven by a mix of team size, topic breadth, and author reputation.When we tested the three different novelty definitions, they all gave the exact same error rate (\(\text{CV-MSE} = 1.7738\)), so we kept NoveltyScore_v1 because it is easiest to explain. Splitting our data showed that mono-domain papers are easier to predict (\(\text{CV-MSE} = 1.6622\)) than interdisciplinary ones (\(\text{CV-MSE} = 1.8802\)). Mono-domain papers rely on tight, local network connections, while interdisciplinary papers rely heavily on team size and high author prestige (PageRank_Avg) to get noticed across different fields.

2.2 Q1.1.3 Distribution of CitedCount

g_raw <- ggplot(df_full, aes(x = CitedCount)) +
  geom_histogram(
    bins = 60,
    fill = "steelblue",
    colour = "white"
  ) +
  theme_minimal()

g_log <- ggplot(df_full, aes(x = LogCited)) +
  geom_histogram(
    bins = 60,
    fill = "darkorange",
    colour = "white"
  ) +
  theme_minimal()

g_raw + g_log

cat(
  "CitedCount: skewness ~",
  e1071::skewness(df_full$CitedCount),
  "\n"
)
## CitedCount: skewness ~ 16.16462
cat(
  "log(CitedCount+1): skewness ~",
  e1071::skewness(df_full$LogCited),
  "\n"
)
## log(CitedCount+1): skewness ~ 0.01314115

[INSERT Q1.1.3 TEXT HERE: Explain the power-law tail. The raw skewness is 16.16, which breaks OLS normality assumptions. The log transformation brings skewness to 0.013, stabilizing variance.]

2.3 Q1.2.1 Fit Baseline LASSO

y <- df_active$LogCited

X <- df_active %>%
  dplyr::select(
    -any_of("RecordID"),
    -CitedCount,
    -LogCited,
    -CommercialCitations,
    -HasCommercialImpact,
    -HighCitation
  ) %>%
  as.matrix()

set.seed(42)

cvfit <- cv.glmnet(
  X,
  y,
  alpha = 1,
  nfolds = 10,
  standardize = TRUE
)

plot(cvfit)

cat(
  "lambda.min:",
  cvfit$lambda.min,
  "\nlambda.1se:",
  cvfit$lambda.1se,
  "\n"
)
## lambda.min: 0.001044611 
## lambda.1se: 0.09085496
coef_1se <- coef(cvfit, s = "lambda.1se")

cat("\nVariables retained at lambda.1se:\n")
## 
## Variables retained at lambda.1se:
print(rownames(coef_1se)[which(coef_1se != 0)])
## [1] "(Intercept)"        "AuthorCount"        "TopicCount"        
## [4] "NoveltyScore_v1"    "NoveltyScore_v2"    "NoveltyScore_v3"   
## [7] "Consolidation_Mean" "PageRank_Avg"

[INSERT Q1.2.1 TEXT HERE: Explain LASSO limits multicollinearity. State you chose lambda.1se (0.0909) over lambda.min (0.0010) to prevent overfitting. List the 7 surviving variables.]

2.4 Q1.2.2 Comparing Novelty Measures

fit_one <- function(df, novelty_var, seed = 42) {

  drop_others <- setdiff(
    c("NoveltyScore_v1", "NoveltyScore_v2", "NoveltyScore_v3"),
    novelty_var
  )

  X_sub <- df %>%
    dplyr::select(
      -any_of("RecordID"),
      -CitedCount,
      -LogCited,
      -CommercialCitations,
      -HasCommercialImpact,
      -HighCitation,
      -all_of(drop_others)
    ) %>%
    as.matrix()

  set.seed(seed)

  cv.glmnet(
    X_sub,
    df$LogCited,
    alpha = 1,
    nfolds = 10,
    standardize = TRUE
  )
}

fits <- list(
  v1 = fit_one(df_active, "NoveltyScore_v1"),
  v2 = fit_one(df_active, "NoveltyScore_v2"),
  v3 = fit_one(df_active, "NoveltyScore_v3")
)

cat("CV-MSE by novelty measure:\n")
## CV-MSE by novelty measure:
print(sapply(fits, function(f) min(f$cvm)))
##       v1       v2       v3 
## 1.773758 1.773758 1.773758

[INSERT Q1.2.2 TEXT HERE: Note that v1, v2, and v3 yielded identical CV-MSE (1.7738). State that NoveltyScore_v1 is kept due to its simpler interpretability.]

2.5 Q1.2.3 LASSO on Domain Subsets

df_inter <- df_active %>% filter(DomainCount > 1)
df_mono  <- df_active %>% filter(DomainCount == 1)

cv_inter <- fit_one(df_inter, "NoveltyScore_v3")
cv_mono  <- fit_one(df_mono, "NoveltyScore_v3")

cat(
  "Interdisciplinary CV-MSE:",
  round(min(cv_inter$cvm), 4),
  "\n"
)
## Interdisciplinary CV-MSE: 1.8802
cat(
  "Mono-domain CV-MSE:",
  round(min(cv_mono$cvm), 4),
  "\n"
)
## Mono-domain CV-MSE: 1.6622

[INSERT Q1.2.3 TEXT HERE: Note mono-domain is easier to predict (MSE 1.6622) than interdisciplinary (MSE 1.8802). Discuss how mono-domain retains local network markers (Consolidation_Mean, Clustering_Mean), while interdisciplinary relies heavily on team size and PageRank_Avg.]

3 Part 2: Non-linear Modelling (Trees)

3.1 Q2.1 Decision Tree Structure

tree_formula <- LogCited ~ . - RecordID - CitedCount - CommercialCitations - 
  HasCommercialImpact - HighCitation - NoveltyScore_v2 - NoveltyScore_v3

set.seed(42)

m_tree <- rpart(
  tree_formula,
  data = df_active,
  method = "anova",
  control = rpart.control(cp = 0.005)
)

rpart.plot(
  m_tree,
  main = "Decision Tree for Log-Citations",
  type = 3,
  digits = 3,
  fallen.leaves = TRUE,
  box.palette = "Blues"
)

[INSERT Q2.1 TEXT HERE: Discuss the root split (AuthorCount < 2), identifying team size as the core divider of impact. Trace a path: Solo papers go down to hat(y)=1.22, large teams with broad topics scale up to hat(y)=3.50.]

3.2 Q2.2 Random Forest Importance

set.seed(42)

m_rf <- randomForest(
  tree_formula,
  data = df_active,
  ntree = 200,
  importance = TRUE
)

cat(
  "Random Forest OOB-MSE:",
  m_rf$mse[length(m_rf$mse)],
  "\n"
)
## Random Forest OOB-MSE: 1.761251
varImpPlot(
  m_rf,
  main = "Random Forest: Variable Importance",
  n.var = 10
)

[INSERT Q2.2 TEXT HERE: Report OOB-MSE (1.7612). Contrast %IncMSE (drops in global accuracy, led by CorePosition_Mean) vs IncNodePurity (clean partition splits, led by NoveltyScore_v1).]

3.3 Q2.3 Partial Dependence Plots

p_nov <- partial(
  m_rf,
  pred.var = "NoveltyScore_v1",
  plot = TRUE,
  plot.engine = "ggplot2"
) + theme_minimal()

p_core <- partial(
  m_rf,
  pred.var = "CorePosition_Mean",
  plot = TRUE,
  plot.engine = "ggplot2"
) + theme_minimal()

p_nov + p_core

[INSERT Q2.3 TEXT HERE: Crucial step. Cite Uzzi et al. (2013) Theory of Optimal Distinctiveness. The NoveltyScore_v1 curve goes up but flattens/drops at extreme levels—proving radical papers get penalized. CorePosition_Mean has a steady upward step.]

4 Part 3: Model Selection and Classification

4.1 Q3.1 Classifier Setup and Safeguards

df_class <- df_active %>%
  mutate(
    HighCitation = factor(
      HighCitation,
      levels = c(0, 1),
      labels = c("Low", "High")
    )
  ) %>%
  dplyr::select(
    -any_of("RecordID"),
    -CitedCount,
    -LogCited,
    -CommercialCitations,
    -HasCommercialImpact,
    -NoveltyScore_v2,
    -NoveltyScore_v3
  )

set.seed(42)

train_idx <- caret::createDataPartition(
  df_class$HighCitation,
  p = 0.7,
  list = FALSE
)

tr <- df_class[train_idx, ]
te <- df_class[-train_idx, ]

predictors <- setdiff(names(tr), "HighCitation")

const_in_group <- sapply(predictors, function(v) {
  any(
    tapply(
      tr[[v]],
      tr$HighCitation,
      function(x) {
        length(unique(x)) <= 1 ||
          isTRUE(sd(x, na.rm = TRUE) < 1e-5)
      }
    )
  )
})

if (any(const_in_group)) {
  tr <- tr[, !(names(tr) %in% predictors[const_in_group])]
  te <- te[, !(names(te) %in% predictors[const_in_group])]
}

[INSERT Q3.1 TEXT HERE: Explain why variables with < 1e-5 variance were dynamically dropped: LDA cannot invert a singular within-class covariance matrix. It prevents algorithm failure.]

4.2 Q3.2 Classifier Evaluation

m_lda <- MASS::lda(HighCitation ~ ., data = tr)

pred_lda_class <- predict(m_lda, newdata = te)$class

pred_lda_prob <- predict(
  m_lda,
  newdata = te
)$posterior[, "High"]

cm_lda <- caret::confusionMatrix(
  pred_lda_class,
  te$HighCitation,
  positive = "High"
)

roc_lda <- pROC::roc(
  te$HighCitation,
  pred_lda_prob,
  levels = c("Low", "High")
)

set.seed(42)

tune_lin <- e1071::tune.svm(
  HighCitation ~ .,
  data = tr,
  kernel = "linear",
  cost = c(0.1, 1, 10),
  probability = TRUE
)

m_svm_lin <- tune_lin$best.model

pred_lin_class <- predict(m_svm_lin, newdata = te)

pred_lin_prob <- attr(
  predict(m_svm_lin, newdata = te, probability = TRUE),
  "probabilities"
)[, "High"]

cm_lin <- caret::confusionMatrix(
  pred_lin_class,
  te$HighCitation,
  positive = "High"
)

roc_lin <- pROC::roc(
  te$HighCitation,
  pred_lin_prob,
  levels = c("Low", "High")
)

set.seed(42)

tune_rbf <- e1071::tune.svm(
  HighCitation ~ .,
  data = tr,
  kernel = "radial",
  cost = c(1, 10),
  gamma = c(0.01, 0.1),
  probability = TRUE
)

m_svm_rbf <- tune_rbf$best.model

pred_rbf_class <- predict(m_svm_rbf, newdata = te)

pred_rbf_prob <- attr(
  predict(m_svm_rbf, newdata = te, probability = TRUE),
  "probabilities"
)[, "High"]

cm_rbf <- caret::confusionMatrix(
  pred_rbf_class,
  te$HighCitation,
  positive = "High"
)

roc_rbf <- pROC::roc(
  te$HighCitation,
  pred_rbf_prob,
  levels = c("Low", "High")
)

plot(roc_lda, col = "blue", main = "ROC Comparison", lwd = 2)
plot(roc_lin, col = "red", add = TRUE, lwd = 2)
plot(roc_rbf, col = "green3", add = TRUE, lwd = 2)

legend(
  "bottomright",
  legend = c("LDA", "Linear SVM", "RBF SVM"),
  col = c("blue", "red", "green3"),
  lwd = 2
)

results <- data.frame(
  Model = c("LDA", "Linear SVM", "RBF SVM"),
  Accuracy = c(
    cm_lda$overall["Accuracy"],
    cm_lin$overall["Accuracy"],
    cm_rbf$overall["Accuracy"]
  ),
  F1_Score = c(
    cm_lda$byClass["F1"],
    cm_lin$byClass["F1"],
    cm_rbf$byClass["F1"]
  )
)

print(results)
##        Model  Accuracy  F1_Score
## 1        LDA 0.6554965 0.6656089
## 2 Linear SVM 0.6534532 0.6640254
## 3    RBF SVM 0.6632611 0.6763551

[INSERT Q3.2 TEXT HERE: Discuss the trade-off. RBF SVM has the highest Accuracy (0.6633) and F1 (0.6764) because the true boundary is non-linear. LDA (0.6555) is slightly worse but offers explicit, transparent coefficients rather than a “black box”.]

5 Part 4: Bias-Variance, Polynomials, and Stacking

5.1 Q4.1 Bias-Variance Trade-Off

reg_formula <- LogCited ~ . - RecordID - CitedCount - CommercialCitations - HasCommercialImpact - HighCitation - NoveltyScore_v2 - NoveltyScore_v3
set.seed(42)
depths <- 1:15
train_mse <- numeric(15)
cv_mse <- numeric(15)

for (i in seq_along(depths)) {

  fit_full <- rpart(
    reg_formula,
    data = df_active,
    control = rpart.control(
      maxdepth = depths[i],
      cp = 0
    )
  )

  train_mse[i] <- mean(
    (predict(fit_full) - df_active$LogCited)^2
  )

  folds <- createFolds(
    df_active$LogCited,
    k = 5,
    list = TRUE
  )

  fold_err <- numeric(5)

  for (j in 1:5) {

    val_idx <- folds[[j]]

    fit_cv <- rpart(
      reg_formula,
      data = df_active[-val_idx, ],
      control = rpart.control(
        maxdepth = depths[i],
        cp = 0
      )
    )

    fold_err[j] <- mean(
      (
        predict(fit_cv, newdata = df_active[val_idx, ]) -
          df_active$LogCited[val_idx]
      )^2
    )
  }

  cv_mse[i] <- mean(fold_err)
}

bv_data <- data.frame(
  Complexity = rep(depths, 2),
  Error = c(train_mse, cv_mse),
  Type = rep(
    c("Training MSE (Bias)", "CV MSE (Variance)"),
    each = 15
  )
)

ggplot(
  bv_data,
  aes(x = Complexity, y = Error, color = Type)
) +
  geom_line(lwd = 1.2) +
  geom_point(size = 2) +
  geom_vline(
    xintercept = depths[which.min(cv_mse)],
    linetype = "dashed",
    color = "black"
  ) +
  theme_minimal()

[INSERT Q4.1 TEXT HERE: Refer to ISLR Figure 2.9. Left side is High Bias (underfitting). Dashed line is the Sweet Spot (depth 4). Right side is High Variance (overfitting) because training error drops but CV error rises.]

5.2 Q4.2 Polynomial Team Dynamics

m_poly <- lm(
  LogCited ~ poly(AuthorCount, 2),
  data = df_active
)

ggplot(df_active, aes(x = AuthorCount, y = LogCited)) +
  geom_point(alpha = 0.1, color = "gray") +
  geom_smooth(
    method = "lm",
    formula = y ~ poly(x, 2),
    color = "firebrick"
  ) +
  coord_cartesian(xlim = c(0, 12)) +
  theme_minimal()

[INSERT Q4.2 TEXT HERE: State that Year could not be used because there are only 2 distinct years in the subset. AuthorCount was used instead. Cite Wuchty et al. (2007)—the concave curve proves larger teams boost impact, but eventually hit diminishing returns due to coordination costs.]

5.3 Q4.3 Multi-method Stacking

set.seed(42)

stack_idx <- createDataPartition(
  df_active$LogCited,
  p = 0.8,
  list = FALSE
)

d_train <- df_active[stack_idx, ]
d_test  <- df_active[-stack_idx, ]

folds <- createFolds(
  d_train$LogCited,
  k = 5,
  list = TRUE
)

oof_lm <- numeric(nrow(d_train))
oof_tree <- numeric(nrow(d_train))

for(i in 1:5) {

  val_idx <- folds[[i]]

  fit_lm_base <- lm(
    reg_formula,
    data = d_train[-val_idx, ]
  )

  fit_tree_base <- rpart(
    reg_formula,
    data = d_train[-val_idx, ],
    control = rpart.control(cp = 0.005)
  )

  oof_lm[val_idx] <- predict(
    fit_lm_base,
    newdata = d_train[val_idx, ]
  )

  oof_tree[val_idx] <- predict(
    fit_tree_base,
    newdata = d_train[val_idx, ]
  )
}

meta_model <- lm(
  d_train$LogCited ~ oof_lm + oof_tree
)

final_lm <- lm(reg_formula, data = d_train)

final_tree <- rpart(
  reg_formula,
  data = d_train,
  control = rpart.control(cp = 0.005)
)

test_pred_lm <- predict(final_lm, newdata = d_test)

test_pred_tree <- predict(
  final_tree,
  newdata = d_test
)

test_pred_stack <- predict(
  meta_model,
  newdata = data.frame(
    oof_lm = test_pred_lm,
    oof_tree = test_pred_tree
  )
)

cat(
  "Linear LM: ",
  mean((test_pred_lm - d_test$LogCited)^2),
  "\n"
)
## Linear LM:  1.749618
cat(
  "Tree:      ",
  mean((test_pred_tree - d_test$LogCited)^2),
  "\n"
)
## Tree:       1.807905
cat(
  "Stack:     ",
  mean((test_pred_stack - d_test$LogCited)^2),
  "\n"
)
## Stack:      1.742187

[INSERT Q4.3 TEXT HERE: The stack meta-learner achieves the lowest Test MSE (1.7422). Explain that base models generated out-of-fold predictions to prevent leakage, and the final stack was evaluated on an untouched 20% hold-out.]

6 Discussion & Conclusion

[INSERT OVERALL DISCUSSION HERE: ~200-300 words wrapping up the whole assignment findings.]

7 References

[INSERT REFERENCES HERE: Burt (2004), Fleming (2007), James et al. (2021), Uzzi et al. (2013), Wuchty et al. (2007).]

8 Appendix: Code Explanation

[INSERT CODE EXPLANATION HERE: Max 500 words.]