Assignment 9 - Pulsars

Instruction Details:

Compare prediction when even number of each pulsar and non pulsars are used (200, 300 and 400 each) vs the proportional approach shown today

Each person in group should do calculation so you can see variation in 2X2 tables

Code:

# LOAD AND PREPARE DATA

Pulsar <- read.csv("pulsar.csv")
Pulsar1 <- Pulsar[, -1]                    # remove index column
Pulsar1$v9 <- as.factor(Pulsar1$v9)

# Separate pulsars (1) and non-pulsars (0)
Pulsar1base <- Pulsar1[Pulsar1[, 9] == 0, ]   # non-pulsars: 16,259 rows
Pulsar1true <- Pulsar1[Pulsar1[, 9] == 1, ]   # pulsars:      1,638 rows

cat("Non-pulsars (0):", nrow(Pulsar1base), "\n")

## Non-pulsars (0): 16259

cat("Pulsars     (1):", nrow(Pulsar1true), "\n\n")

## Pulsars     (1): 1639

# HELPER FUNCTION: Build and print 2x2 confusion matrix
make_2x2 <- function(predicted_probs, actuals, label) {

  # Convert actuals to 0/1 numeric (factor levels 1,2 → 0,1)
  act  <- as.numeric(actuals) - 1
  pred <- as.numeric(predicted_probs > 0.5)

  # 2x2 table cells — professor's exact formula
  TP <- sum(pred * act)              # predicted 1, actual 1
  FN <- sum((1 - pred) * act)       # predicted 0, actual 1
  FP <- sum(pred * (1 - act))       # predicted 1, actual 0
  TN <- sum((1 - pred) * (1 - act)) # predicted 0, actual 0

  mat <- matrix(
    c(TP, FN, FP, TN),
    nrow = 2, byrow = TRUE,
    dimnames = list(
      Predicted = c("Pred_1 (Pulsar)", "Pred_0 (Not Pulsar)"),
      Actual    = c("Act_1 (Pulsar)",  "Act_0 (Not Pulsar)")
    )
  )

  # Print results
  print(mat)
  cat(sprintf("  Accuracy : %.4f\n", (TP + TN) / (TP + TN + FP + FN)))
  cat(sprintf("  Precision: %.4f\n", TP / (TP + FP)))
  cat(sprintf("  Recall   : %.4f\n", TP / (TP + FN)))
  cat(sprintf("  F1 Score : %.4f\n\n", 2 * (TP/(TP+FP)) * (TP/(TP+FN)) /
                ((TP/(TP+FP)) + (TP/(TP+FN)))))

  return(c(TP = TP, FN = FN, FP = FP, TN = TN,
           Accuracy  = round((TP + TN) / (TP + TN + FP + FN), 4),
           Precision = round(TP / (TP + FP), 4),
           Recall    = round(TP / (TP + FN), 4),
           F1        = round(2*(TP/(TP+FP))*(TP/(TP+FN)) /
                               ((TP/(TP+FP))+(TP/(TP+FN))), 4)))
}

# SET SEED — each person changes this to see variation!
set.seed(123)   # <-- CHANGE THIS: 123, 456, 789, 999, etc.


# APPROACH 1: PROPORTIONAL SAMPLING
# 1000 non-pulsars + 100 pulsars
# Keeps natural class imbalance (~10:1 ratio)


cat("APPROACH 1: PROPORTIONAL SAMPLING\n")

## APPROACH 1: PROPORTIONAL SAMPLING

cat("Training: 1000 non-pulsars + 100 pulsars\n")

## Training: 1000 non-pulsars + 100 pulsars

cat("Testing:  Full dataset\n")

## Testing:  Full dataset

v1_prop <- sample(nrow(Pulsar1true), 100)    # 100 pulsars
v2_prop <- sample(nrow(Pulsar1base), 1000)   # 1000 non-pulsars

Pulsartry_prop <- rbind(Pulsar1base[v2_prop, ], Pulsar1true[v1_prop, ])

logit_prop <- glm(v9 ~ ., data = Pulsartry_prop, family = binomial(link = "logit"))
p1_prop    <- predict(logit_prop, type = "response", newdata = Pulsar1)

r1 <- make_2x2(p1_prop, Pulsar1$v9, "Proportional - Logistic Regression")

##                      Actual
## Predicted             Act_1 (Pulsar) Act_0 (Not Pulsar)
##   Pred_1 (Pulsar)               1338                301
##   Pred_0 (Not Pulsar)             78              16181
##   Accuracy : 0.9788
##   Precision: 0.9449
##   Recall   : 0.8164
##   F1 Score : 0.8759

# APPROACH 2: BALANCED SAMPLING — 200 each

cat("APPROACH 2: BALANCED SAMPLING — 200 each\n")

## APPROACH 2: BALANCED SAMPLING — 200 each

cat("Training: 200 non-pulsars + 200 pulsars\n")

## Training: 200 non-pulsars + 200 pulsars

cat("Testing:  Full dataset\n")

## Testing:  Full dataset

v1_200 <- sample(nrow(Pulsar1true), 200)
v2_200 <- sample(nrow(Pulsar1base), 200)

Pulsartry_200  <- rbind(Pulsar1base[v2_200, ], Pulsar1true[v1_200, ])
logit_200      <- glm(v9 ~ ., data = Pulsartry_200, family = binomial(link = "logit"))
p1_200         <- predict(logit_200, type = "response", newdata = Pulsar1)

r2 <- make_2x2(p1_200, Pulsar1$v9, "Balanced 200 - Logistic Regression")

##                      Actual
## Predicted             Act_1 (Pulsar) Act_0 (Not Pulsar)
##   Pred_1 (Pulsar)               1499                140
##   Pred_0 (Not Pulsar)            743              15516
##   Accuracy : 0.9507
##   Precision: 0.6686
##   Recall   : 0.9146
##   F1 Score : 0.7725

# APPROACH 3: BALANCED SAMPLING — 300 each


cat("APPROACH 3: BALANCED SAMPLING — 300 each\n")

## APPROACH 3: BALANCED SAMPLING — 300 each

cat("Training: 300 non-pulsars + 300 pulsars\n")

## Training: 300 non-pulsars + 300 pulsars

cat("Testing:  Full dataset\n")

## Testing:  Full dataset

v1_300 <- sample(nrow(Pulsar1true), 300)
v2_300 <- sample(nrow(Pulsar1base), 300)

Pulsartry_300  <- rbind(Pulsar1base[v2_300, ], Pulsar1true[v1_300, ])
logit_300      <- glm(v9 ~ ., data = Pulsartry_300, family = binomial(link = "logit"))
p1_300         <- predict(logit_300, type = "response", newdata = Pulsar1)

r3 <- make_2x2(p1_300, Pulsar1$v9, "Balanced 300 - Logistic Regression")

##                      Actual
## Predicted             Act_1 (Pulsar) Act_0 (Not Pulsar)
##   Pred_1 (Pulsar)               1500                139
##   Pred_0 (Not Pulsar)            494              15765
##   Accuracy : 0.9646
##   Precision: 0.7523
##   Recall   : 0.9152
##   F1 Score : 0.8258

# APPROACH 4: BALANCED SAMPLING — 400 each


cat("APPROACH 4: BALANCED SAMPLING — 400 each\n")

## APPROACH 4: BALANCED SAMPLING — 400 each

cat("Training: 400 non-pulsars + 400 pulsars\n")

## Training: 400 non-pulsars + 400 pulsars

cat("Testing:  Full dataset\n")

## Testing:  Full dataset

v1_400 <- sample(nrow(Pulsar1true), 400)
v2_400 <- sample(nrow(Pulsar1base), 400)

Pulsartry_400  <- rbind(Pulsar1base[v2_400, ], Pulsar1true[v1_400, ])
logit_400      <- glm(v9 ~ ., data = Pulsartry_400, family = binomial(link = "logit"))
p1_400         <- predict(logit_400, type = "response", newdata = Pulsar1)

r4 <- make_2x2(p1_400, Pulsar1$v9, "Balanced 400 - Logistic Regression")

##                      Actual
## Predicted             Act_1 (Pulsar) Act_0 (Not Pulsar)
##   Pred_1 (Pulsar)               1501                138
##   Pred_0 (Not Pulsar)            443              15816
##   Accuracy : 0.9675
##   Precision: 0.7721
##   Recall   : 0.9158
##   F1 Score : 0.8378

# SUMMARY TABLE
cat("SUMMARY: All Models Compared\n")

## SUMMARY: All Models Compared

summary_table <- rbind(
  Proportional  = r1,
  Balanced_200  = r2,
  Balanced_300  = r3,
  Balanced_400  = r4
)

print(summary_table)

##                TP  FN  FP    TN Accuracy Precision Recall     F1
## Proportional 1338 301  78 16181   0.9788    0.9449 0.8164 0.8759
## Balanced_200 1499 140 743 15516   0.9507    0.6686 0.9146 0.7725
## Balanced_300 1500 139 494 15765   0.9646    0.7523 0.9152 0.8258
## Balanced_400 1501 138 443 15816   0.9675    0.7721 0.9158 0.8378

Summary:

Data set Overview:
- The dataset is heavily imbalanced: 16,259 non-pulsars vs 1,639 pulsars. This means pulsars represent only about 9% of the data. This imbalance is the core reason why sampling strategy matters significantly.
Proportional Sampling Analysis:
- This achieved the highest accuracy (97.88%) and highest precision (94.49%). However, it had the lowest recall (81.64%) meaning it missed 301 actual pulsars (FN = 301). The model is very conservative — it rarely predicts pulsar unless very confident. This makes sense because the training data reflected the real-world imbalance, so the model learned to favor predicting non-pulsars.
Balanced Sampling Analysis:
- All balanced approaches achieved higher recall (~91-92%) meaning they caught more actual pulsars. The trade-off is lower precision — more false positives (FP):
  - Balanced 200: 743 false positives
  - Balanced 300: 494 false positives
  - Balanced 400: 443 false positives
- By training on equal numbers of each class, the model became more sensitive to detecting pulsars. As sample size increased from 200 → 300 → 400, precision improved while recall stayed stable, showing more training data helps.
The F1 score which balances precision and recall shows Proportional (0.8759) still edges out Balanced 400 (0.8378), but the gap narrows as balanced sample size grow.

The sampling strategy fundamentally changes how the logistic regression model behaves. The proportional approach produces a model that mirrors real-world class distribution but struggles to identify the minority class (pulsars). The balanced approaches sacrifice some precision to dramatically improve recall, which is arguably more important in scientific discovery contexts where missing a true pulsar is a significant loss. As balanced sample size increases, the model improves across all metrics suggesting that with enough balanced data, the gap with proportional sampling closes significantly.

Assignment 9 - Pulsars

Tsing Lee & Romika Jani

Instruction Details:

Code:

Summary: