Compare prediction when even number of each pulsar and non pulsars are used (200, 300 and 400 each) vs the proportional approach shown today
Each person in group should do calculation so you can see variation in 2X2 tables
# LOAD AND PREPARE DATA
Pulsar <- read.csv("pulsar.csv")
Pulsar1 <- Pulsar[, -1] # remove index column
Pulsar1$v9 <- as.factor(Pulsar1$v9)
# Separate pulsars (1) and non-pulsars (0)
Pulsar1base <- Pulsar1[Pulsar1[, 9] == 0, ] # non-pulsars: 16,259 rows
Pulsar1true <- Pulsar1[Pulsar1[, 9] == 1, ] # pulsars: 1,638 rows
cat("Non-pulsars (0):", nrow(Pulsar1base), "\n")
## Non-pulsars (0): 16259
cat("Pulsars (1):", nrow(Pulsar1true), "\n\n")
## Pulsars (1): 1639
# HELPER FUNCTION: Build and print 2x2 confusion matrix
make_2x2 <- function(predicted_probs, actuals, label) {
# Convert actuals to 0/1 numeric (factor levels 1,2 → 0,1)
act <- as.numeric(actuals) - 1
pred <- as.numeric(predicted_probs > 0.5)
# 2x2 table cells — professor's exact formula
TP <- sum(pred * act) # predicted 1, actual 1
FN <- sum((1 - pred) * act) # predicted 0, actual 1
FP <- sum(pred * (1 - act)) # predicted 1, actual 0
TN <- sum((1 - pred) * (1 - act)) # predicted 0, actual 0
mat <- matrix(
c(TP, FN, FP, TN),
nrow = 2, byrow = TRUE,
dimnames = list(
Predicted = c("Pred_1 (Pulsar)", "Pred_0 (Not Pulsar)"),
Actual = c("Act_1 (Pulsar)", "Act_0 (Not Pulsar)")
)
)
# Print results
print(mat)
cat(sprintf(" Accuracy : %.4f\n", (TP + TN) / (TP + TN + FP + FN)))
cat(sprintf(" Precision: %.4f\n", TP / (TP + FP)))
cat(sprintf(" Recall : %.4f\n", TP / (TP + FN)))
cat(sprintf(" F1 Score : %.4f\n\n", 2 * (TP/(TP+FP)) * (TP/(TP+FN)) /
((TP/(TP+FP)) + (TP/(TP+FN)))))
return(c(TP = TP, FN = FN, FP = FP, TN = TN,
Accuracy = round((TP + TN) / (TP + TN + FP + FN), 4),
Precision = round(TP / (TP + FP), 4),
Recall = round(TP / (TP + FN), 4),
F1 = round(2*(TP/(TP+FP))*(TP/(TP+FN)) /
((TP/(TP+FP))+(TP/(TP+FN))), 4)))
}
# SET SEED — each person changes this to see variation!
set.seed(123) # <-- CHANGE THIS: 123, 456, 789, 999, etc.
# APPROACH 1: PROPORTIONAL SAMPLING
# 1000 non-pulsars + 100 pulsars
# Keeps natural class imbalance (~10:1 ratio)
cat("APPROACH 1: PROPORTIONAL SAMPLING\n")
## APPROACH 1: PROPORTIONAL SAMPLING
cat("Training: 1000 non-pulsars + 100 pulsars\n")
## Training: 1000 non-pulsars + 100 pulsars
cat("Testing: Full dataset\n")
## Testing: Full dataset
v1_prop <- sample(nrow(Pulsar1true), 100) # 100 pulsars
v2_prop <- sample(nrow(Pulsar1base), 1000) # 1000 non-pulsars
Pulsartry_prop <- rbind(Pulsar1base[v2_prop, ], Pulsar1true[v1_prop, ])
logit_prop <- glm(v9 ~ ., data = Pulsartry_prop, family = binomial(link = "logit"))
p1_prop <- predict(logit_prop, type = "response", newdata = Pulsar1)
r1 <- make_2x2(p1_prop, Pulsar1$v9, "Proportional - Logistic Regression")
## Actual
## Predicted Act_1 (Pulsar) Act_0 (Not Pulsar)
## Pred_1 (Pulsar) 1338 301
## Pred_0 (Not Pulsar) 78 16181
## Accuracy : 0.9788
## Precision: 0.9449
## Recall : 0.8164
## F1 Score : 0.8759
# APPROACH 2: BALANCED SAMPLING — 200 each
cat("APPROACH 2: BALANCED SAMPLING — 200 each\n")
## APPROACH 2: BALANCED SAMPLING — 200 each
cat("Training: 200 non-pulsars + 200 pulsars\n")
## Training: 200 non-pulsars + 200 pulsars
cat("Testing: Full dataset\n")
## Testing: Full dataset
v1_200 <- sample(nrow(Pulsar1true), 200)
v2_200 <- sample(nrow(Pulsar1base), 200)
Pulsartry_200 <- rbind(Pulsar1base[v2_200, ], Pulsar1true[v1_200, ])
logit_200 <- glm(v9 ~ ., data = Pulsartry_200, family = binomial(link = "logit"))
p1_200 <- predict(logit_200, type = "response", newdata = Pulsar1)
r2 <- make_2x2(p1_200, Pulsar1$v9, "Balanced 200 - Logistic Regression")
## Actual
## Predicted Act_1 (Pulsar) Act_0 (Not Pulsar)
## Pred_1 (Pulsar) 1499 140
## Pred_0 (Not Pulsar) 743 15516
## Accuracy : 0.9507
## Precision: 0.6686
## Recall : 0.9146
## F1 Score : 0.7725
# APPROACH 3: BALANCED SAMPLING — 300 each
cat("APPROACH 3: BALANCED SAMPLING — 300 each\n")
## APPROACH 3: BALANCED SAMPLING — 300 each
cat("Training: 300 non-pulsars + 300 pulsars\n")
## Training: 300 non-pulsars + 300 pulsars
cat("Testing: Full dataset\n")
## Testing: Full dataset
v1_300 <- sample(nrow(Pulsar1true), 300)
v2_300 <- sample(nrow(Pulsar1base), 300)
Pulsartry_300 <- rbind(Pulsar1base[v2_300, ], Pulsar1true[v1_300, ])
logit_300 <- glm(v9 ~ ., data = Pulsartry_300, family = binomial(link = "logit"))
p1_300 <- predict(logit_300, type = "response", newdata = Pulsar1)
r3 <- make_2x2(p1_300, Pulsar1$v9, "Balanced 300 - Logistic Regression")
## Actual
## Predicted Act_1 (Pulsar) Act_0 (Not Pulsar)
## Pred_1 (Pulsar) 1500 139
## Pred_0 (Not Pulsar) 494 15765
## Accuracy : 0.9646
## Precision: 0.7523
## Recall : 0.9152
## F1 Score : 0.8258
# APPROACH 4: BALANCED SAMPLING — 400 each
cat("APPROACH 4: BALANCED SAMPLING — 400 each\n")
## APPROACH 4: BALANCED SAMPLING — 400 each
cat("Training: 400 non-pulsars + 400 pulsars\n")
## Training: 400 non-pulsars + 400 pulsars
cat("Testing: Full dataset\n")
## Testing: Full dataset
v1_400 <- sample(nrow(Pulsar1true), 400)
v2_400 <- sample(nrow(Pulsar1base), 400)
Pulsartry_400 <- rbind(Pulsar1base[v2_400, ], Pulsar1true[v1_400, ])
logit_400 <- glm(v9 ~ ., data = Pulsartry_400, family = binomial(link = "logit"))
p1_400 <- predict(logit_400, type = "response", newdata = Pulsar1)
r4 <- make_2x2(p1_400, Pulsar1$v9, "Balanced 400 - Logistic Regression")
## Actual
## Predicted Act_1 (Pulsar) Act_0 (Not Pulsar)
## Pred_1 (Pulsar) 1501 138
## Pred_0 (Not Pulsar) 443 15816
## Accuracy : 0.9675
## Precision: 0.7721
## Recall : 0.9158
## F1 Score : 0.8378
# SUMMARY TABLE
cat("SUMMARY: All Models Compared\n")
## SUMMARY: All Models Compared
summary_table <- rbind(
Proportional = r1,
Balanced_200 = r2,
Balanced_300 = r3,
Balanced_400 = r4
)
print(summary_table)
## TP FN FP TN Accuracy Precision Recall F1
## Proportional 1338 301 78 16181 0.9788 0.9449 0.8164 0.8759
## Balanced_200 1499 140 743 15516 0.9507 0.6686 0.9146 0.7725
## Balanced_300 1500 139 494 15765 0.9646 0.7523 0.9152 0.8258
## Balanced_400 1501 138 443 15816 0.9675 0.7721 0.9158 0.8378
Data set Overview:
Proportional Sampling Analysis:
Balanced Sampling Analysis:
All balanced approaches achieved higher recall (~91-92%) meaning they caught more actual pulsars. The trade-off is lower precision — more false positives (FP):
Balanced 200: 743 false positives
Balanced 300: 494 false positives
Balanced 400: 443 false positives
By training on equal numbers of each class, the model became more sensitive to detecting pulsars. As sample size increased from 200 → 300 → 400, precision improved while recall stayed stable, showing more training data helps.
The F1 score which balances precision and recall shows Proportional (0.8759) still edges out Balanced 400 (0.8378), but the gap narrows as balanced sample size grow.
The sampling strategy fundamentally changes how the logistic regression model behaves. The proportional approach produces a model that mirrors real-world class distribution but struggles to identify the minority class (pulsars). The balanced approaches sacrifice some precision to dramatically improve recall, which is arguably more important in scientific discovery contexts where missing a true pulsar is a significant loss. As balanced sample size increases, the model improves across all metrics suggesting that with enough balanced data, the gap with proportional sampling closes significantly.