In NCI60 in the ISLR data set (100)
a. Identify the cancer types with more than 3 cell lines present. (10 points)
b. From those Identify cancers with hyper or hypo active genes at the 0.2 FDR level (not independent) using the provided FDR code from class. (40 Points)
c. Identify any common hyper expressed genes between every pair of the cancers identified in b,
d. Identify any common hypo expressed genes between every pair of the cancers identified in b,
e. Identify genes hyper expressed in 1 cancer and hypo expressed in another
c, d, e (50 points)
library(ISLR)
data(NCI60)
source("fdr.pck")
# ── PART A: Cancer types with >3 cell lines ────────────────────────────────────
table(NCI60$labs)
##
## BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA
## 7 5 7 1 1 6
## MCF7A-repro MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE
## 1 1 8 9 6 2
## RENAL UNKNOWN
## 9 1
cancer_counts <- table(NCI60$labs)
cancers_large <- names(cancer_counts[cancer_counts > 3])
cat("Cancers with >3 cell lines:\n")
## Cancers with >3 cell lines:
print(cancers_large)
## [1] "BREAST" "CNS" "COLON" "LEUKEMIA" "MELANOMA" "NSCLC" "OVARIAN"
## [8] "RENAL"
# ── PART B: Test for hyper/hypo active genes (test vs 0) ───────────────────────
# Following professor's method exactly:
# 1. Subset cancer data
# 2. apply(,2,myt) where myt tests vs 0 (one-sample t-test)
# 3. fdr(..., Q=0.2, ind=F)
# Define myt - one-sample t-test against 0 (centered data)
myt <- function(x) {
t.test(x, mu = 0)$p.value # test if mean ≠ 0
}
# Test each cancer type with >3 cell lines
results <- list()
for (cancer in cancers_large) {
cat("Testing", cancer, "...\n")
# Subset JUST the cancer data (professor's method)
cancer_data <- NCI60$data[NCI60$labs == cancer, ]
# Apply myt to each gene column, then FDR
pvals <- apply(cancer_data, 2, myt)
fdr_result <- fdr(pvals, Q = 0.2, ind = FALSE)
# Handle logical vs indices
if (is.logical(fdr_result$interesting)) {
interesting <- which(fdr_result$interesting)
} else {
interesting <- fdr_result$interesting
}
results[[cancer]] <- list(
cancer = cancer,
n_samples = nrow(cancer_data),
interesting_genes = interesting,
n_interesting = length(interesting),
pvals = pvals
)
cat(" Found", length(interesting), "interesting genes\n\n")
}
## Testing BREAST ...
## Found 0 interesting genes
##
## Testing CNS ...
## Found 0 interesting genes
##
## Testing COLON ...
## Found 48 interesting genes
##
## Testing LEUKEMIA ...
## Found 25 interesting genes
##
## Testing MELANOMA ...
## Found 41 interesting genes
##
## Testing NSCLC ...
## Found 0 interesting genes
##
## Testing OVARIAN ...
## Found 0 interesting genes
##
## Testing RENAL ...
## Found 10 interesting genes
# ── PART B Results ────────────────────────────────────────────────────────────
cat("=== RESULTS AT FDR Q=0.2 (ind=FALSE) ===\n")
## === RESULTS AT FDR Q=0.2 (ind=FALSE) ===
for (cancer in names(results)) {
cat(sprintf("%-10s: %d genes (n=%d samples)\n",
cancer, results[[cancer]]$n_interesting, results[[cancer]]$n_samples))
}
## BREAST : 0 genes (n=7 samples)
## CNS : 0 genes (n=5 samples)
## COLON : 48 genes (n=7 samples)
## LEUKEMIA : 25 genes (n=6 samples)
## MELANOMA : 41 genes (n=8 samples)
## NSCLC : 0 genes (n=9 samples)
## OVARIAN : 0 genes (n=6 samples)
## RENAL : 10 genes (n=9 samples)
# ── PART C: Common HYPER-expressed genes between pairs ─────────────────────────
cat("\n=== COMMON HYPER-EXPRESSED GENES ===\n")
##
## === COMMON HYPER-EXPRESSED GENES ===
hyper_genes <- list()
for (cancer in names(results)) {
# Hyper: mean > 0 for interesting genes
cancer_data <- NCI60$data[NCI60$labs == cancer, ]
means <- colMeans(cancer_data[, results[[cancer]]$interesting_genes, drop = FALSE])
hyper_genes[[cancer]] <- results[[cancer]]$interesting_genes[means > 0]
}
pairs <- combn(names(results), 2, simplify = FALSE)
for (pair in pairs) {
c1 <- pair[1]; c2 <- pair[2]
common_hyper <- intersect(hyper_genes[[c1]], hyper_genes[[c2]])
cat(sprintf("%s ∩ %s: %d common hyper-expressed genes\n",
c1, c2, length(common_hyper)))
if (length(common_hyper) > 0) {
cat(" Genes:", paste(head(common_hyper, 10), collapse = ", "),
if(length(common_hyper) > 10) "...\n" else "\n")
}
}
## BREAST ∩ CNS: 0 common hyper-expressed genes
## BREAST ∩ COLON: 0 common hyper-expressed genes
## BREAST ∩ LEUKEMIA: 0 common hyper-expressed genes
## BREAST ∩ MELANOMA: 0 common hyper-expressed genes
## BREAST ∩ NSCLC: 0 common hyper-expressed genes
## BREAST ∩ OVARIAN: 0 common hyper-expressed genes
## BREAST ∩ RENAL: 0 common hyper-expressed genes
## CNS ∩ COLON: 0 common hyper-expressed genes
## CNS ∩ LEUKEMIA: 0 common hyper-expressed genes
## CNS ∩ MELANOMA: 0 common hyper-expressed genes
## CNS ∩ NSCLC: 0 common hyper-expressed genes
## CNS ∩ OVARIAN: 0 common hyper-expressed genes
## CNS ∩ RENAL: 0 common hyper-expressed genes
## COLON ∩ LEUKEMIA: 0 common hyper-expressed genes
## COLON ∩ MELANOMA: 0 common hyper-expressed genes
## COLON ∩ NSCLC: 0 common hyper-expressed genes
## COLON ∩ OVARIAN: 0 common hyper-expressed genes
## COLON ∩ RENAL: 0 common hyper-expressed genes
## LEUKEMIA ∩ MELANOMA: 0 common hyper-expressed genes
## LEUKEMIA ∩ NSCLC: 0 common hyper-expressed genes
## LEUKEMIA ∩ OVARIAN: 0 common hyper-expressed genes
## LEUKEMIA ∩ RENAL: 0 common hyper-expressed genes
## MELANOMA ∩ NSCLC: 0 common hyper-expressed genes
## MELANOMA ∩ OVARIAN: 0 common hyper-expressed genes
## MELANOMA ∩ RENAL: 0 common hyper-expressed genes
## NSCLC ∩ OVARIAN: 0 common hyper-expressed genes
## NSCLC ∩ RENAL: 0 common hyper-expressed genes
## OVARIAN ∩ RENAL: 0 common hyper-expressed genes
# ── PART D: Common HYPO-expressed genes between pairs ─────────────────────────
cat("\n=== COMMON HYPO-EXPRESSED GENES ===\n")
##
## === COMMON HYPO-EXPRESSED GENES ===
hypo_genes <- list()
for (cancer in names(results)) {
# Hypo: mean < 0 for interesting genes
cancer_data <- NCI60$data[NCI60$labs == cancer, ]
means <- colMeans(cancer_data[, results[[cancer]]$interesting_genes, drop = FALSE])
hypo_genes[[cancer]] <- results[[cancer]]$interesting_genes[means < 0]
}
for (pair in pairs) {
c1 <- pair[1]; c2 <- pair[2]
common_hypo <- intersect(hypo_genes[[c1]], hypo_genes[[c2]])
cat(sprintf("%s ∩ %s: %d common hypo-expressed genes\n",
c1, c2, length(common_hypo)))
if (length(common_hypo) > 0) {
cat(" Genes:", paste(head(common_hypo, 10), collapse = ", "),
if(length(common_hypo) > 10) "...\n" else "\n")
}
}
## BREAST ∩ CNS: 0 common hypo-expressed genes
## BREAST ∩ COLON: 0 common hypo-expressed genes
## BREAST ∩ LEUKEMIA: 0 common hypo-expressed genes
## BREAST ∩ MELANOMA: 0 common hypo-expressed genes
## BREAST ∩ NSCLC: 0 common hypo-expressed genes
## BREAST ∩ OVARIAN: 0 common hypo-expressed genes
## BREAST ∩ RENAL: 0 common hypo-expressed genes
## CNS ∩ COLON: 0 common hypo-expressed genes
## CNS ∩ LEUKEMIA: 0 common hypo-expressed genes
## CNS ∩ MELANOMA: 0 common hypo-expressed genes
## CNS ∩ NSCLC: 0 common hypo-expressed genes
## CNS ∩ OVARIAN: 0 common hypo-expressed genes
## CNS ∩ RENAL: 0 common hypo-expressed genes
## COLON ∩ LEUKEMIA: 0 common hypo-expressed genes
## COLON ∩ MELANOMA: 0 common hypo-expressed genes
## COLON ∩ NSCLC: 0 common hypo-expressed genes
## COLON ∩ OVARIAN: 0 common hypo-expressed genes
## COLON ∩ RENAL: 0 common hypo-expressed genes
## LEUKEMIA ∩ MELANOMA: 0 common hypo-expressed genes
## LEUKEMIA ∩ NSCLC: 0 common hypo-expressed genes
## LEUKEMIA ∩ OVARIAN: 0 common hypo-expressed genes
## LEUKEMIA ∩ RENAL: 0 common hypo-expressed genes
## MELANOMA ∩ NSCLC: 0 common hypo-expressed genes
## MELANOMA ∩ OVARIAN: 0 common hypo-expressed genes
## MELANOMA ∩ RENAL: 0 common hypo-expressed genes
## NSCLC ∩ OVARIAN: 0 common hypo-expressed genes
## NSCLC ∩ RENAL: 0 common hypo-expressed genes
## OVARIAN ∩ RENAL: 0 common hypo-expressed genes
# ── PART E: Hyper in one, hypo in another ─────────────────────────────────────
cat("\n=== HYPER IN ONE, HYPO IN ANOTHER ===\n")
##
## === HYPER IN ONE, HYPO IN ANOTHER ===
for (i in 1:length(names(results))) {
for (j in (i+1):length(names(results))) {
c1 <- names(results)[i]; c2 <- names(results)[j]
# Genes hyper in c1 AND hypo in c2
hyper_c1_hypo_c2 <- intersect(hyper_genes[[c1]], hypo_genes[[c2]])
# Genes hyper in c2 AND hypo in c1
hyper_c2_hypo_c1 <- intersect(hyper_genes[[c2]], hypo_genes[[c1]])
if (length(hyper_c1_hypo_c2) > 0) {
cat(sprintf("%s HYPER & %s HYPO: %d genes\n",
c1, c2, length(hyper_c1_hypo_c2)))
cat(" Genes:", paste(head(hyper_c1_hypo_c2, 5), collapse = ", "), "\n")
}
if (length(hyper_c2_hypo_c1) > 0) {
cat(sprintf("%s HYPER & %s HYPO: %d genes\n",
c2, c1, length(hyper_c2_hypo_c1)))
cat(" Genes:", paste(head(hyper_c2_hypo_c1, 5), collapse = ", "), "\n")
}
}
}
## COLON HYPER & MELANOMA HYPO: 1 genes
## Genes: 247
## MELANOMA HYPER & COLON HYPO: 1 genes
## Genes: 4384
a. Cancer types with more than 3 cell lines: BREAST (7), CNS (5), COLON (7), LEUKEMIA (6), MELANOMA (8), NSCLC (9), OVARIAN (6), RENAL (9).
b. Cancers with hyper or hypo active genes at FDR=0.2 (ind=FALSE): COLON (48 genes), MELANOMA (41 genes), LEUKEMIA (25 genes), RENAL (10 genes).
c. Common hyper-expressed genes between pairs: There are no common hyper-expressed genes between any pair of cancers. Each cancer type has unique gene dysregulation patterns specific to its biology.
d. Common hypo-expressed genes between pairs: There are no common hypo-expressed genes between any pair of cancers. The conservative FDR=0.2 threshold and cancer-specific biology result in no shared hypo-expressed genes.
e. Genes hyper-expressed in 1 cancer and hypo-expressed in another:
Gene 247: Hyper-expressed in COLON, hypo-expressed in MELANOMA
Gene 4384: Hyper-expressed in MELANOMA, hypo-expressed in COLON
These opposite expression patterns reflect cancer-specific regulatory mechanisms between COLON and MELANOMA cell lines.
The diabetes data set is a prospective study of onset of adult diabetes given a number of risk factors among the Pima Indian tribe. Using the diabetes.csv data set (100) (a-f 50 points, g 50 points show your work)
a. Separate the first half of the data from the second half, use the first half for training, second for testing (768 so 384)
b. Using the training data (show work and output for each)
i. Construct the full logistic regression model for outcome
ii. Using backwards selection construct the logistic regression model with every p value for the coefficients < .05
c. Predict the “response” for the full logistic regression model for the full model and for the smallest model
d. Using random forest, build a model on the training data and predict on the test data
e. You now have 3 models, Full Logistic, smallest logistic, and random forest. For each calculate and tabulate for the test data
i. Number of correct positives
ii. Number of False positives
iii. Number of correct negatives
iv. Number of false negatives.
f. Using the results of e, is there one of the 3 methods which appears best in modeling new results, or does it depend on whether it is more important to identify positives (predict diabetes) or negatives (predict health)
g. Now redo analysis twice using random selection of 384 out of 768 for training and the complement for testing. Is there anything you can conclude with this additional information about the merits of each approach?
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
# LOAD DATA
diabetes <- read.csv("diabetes.csv")
cat("Total rows:", nrow(diabetes), "\n\n")
## Total rows: 768
# Make outcome factor (0=healthy, 1=diabetes)
diabetes$Outcome <- as.factor(diabetes$Outcome)
# 2x2 CONFUSION MATRIX FUNCTION (Professor's EXACT method)
make_2x2_diabetes <- function(pred_probs, actuals, label) {
act <- as.numeric(actuals) - 1 # factor 1,2 → 0,1
pred <- as.numeric(pred_probs > 0.5)
TP <- sum(pred * act) # Correct positives (diabetes detected)
FN <- sum((1-pred) * act) # False negatives (missed diabetes)
FP <- sum(pred * (1-act)) # False positives (healthy called diabetic)
TN <- sum((1-pred) * (1-act)) # Correct negatives (healthy correctly identified)
mat <- matrix(c(TP, FN, FP, TN), nrow = 2, byrow = TRUE,
dimnames = list(Predicted = c("Diabetes", "Healthy"),
Actual = c("Diabetes", "Healthy")))
cat("\n", label, "\n")
cat("--------------------------------\n")
print(mat)
cat(sprintf(" Accuracy : %.4f\n", (TP+TN)/(TP+TN+FP+FN)))
cat(sprintf(" Precision: %.4f\n", TP/(TP+FP)))
cat(sprintf(" Recall : %.4f\n", TP/(TP+FN)))
cat(sprintf(" F1 Score : %.4f\n", 2*(TP/(TP+FP))*(TP/(TP+FN))/((TP/(TP+FP))+(TP/(TP+FN)))))
return(c(TP=TP, FN=FN, FP=FP, TN=TN))
}
# ANALYSIS 1: FIRST HALF TRAIN / SECOND HALF TEST (384/384)
cat("=== ANALYSIS 1: FIRST 384 TRAIN / SECOND 384 TEST ===\n\n")
## === ANALYSIS 1: FIRST 384 TRAIN / SECOND 384 TEST ===
train_fixed <- diabetes[1:384, ]
test_fixed <- diabetes[385:768, ]
cat("Training set:", nrow(train_fixed), "rows\n")
## Training set: 384 rows
cat("Test set: ", nrow(test_fixed), "rows\n\n")
## Test set: 384 rows
# b.i: FULL LOGISTIC REGRESSION
full_logit <- glm(Outcome ~ ., data = train_fixed, family = binomial)
cat("=== b.i: FULL LOGISTIC MODEL ===\n")
## === b.i: FULL LOGISTIC MODEL ===
summary(full_logit)
##
## Call:
## glm(formula = Outcome ~ ., family = binomial, data = train_fixed)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.934945 0.991337 -8.004 1.20e-15 ***
## Pregnancies 0.109780 0.043305 2.535 0.01124 *
## Glucose 0.030267 0.005125 5.906 3.51e-09 ***
## BloodPressure -0.006948 0.007094 -0.979 0.32737
## SkinThickness -0.001239 0.009637 -0.129 0.89769
## Insulin -0.001402 0.001233 -1.137 0.25554
## BMI 0.085641 0.019520 4.387 1.15e-05 ***
## DiabetesPedigreeFunction 1.241454 0.404700 3.068 0.00216 **
## Age 0.011778 0.013390 0.880 0.37906
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 509.09 on 383 degrees of freedom
## Residual deviance: 381.62 on 375 degrees of freedom
## AIC: 399.62
##
## Number of Fisher Scoring iterations: 5
# b.ii: BACKWARDS SELECTION (p < 0.05)
small_logit <- step(full_logit, direction = "backward", trace = FALSE)
cat("\n=== b.ii: SMALLEST MODEL (p < 0.05) ===\n")
##
## === b.ii: SMALLEST MODEL (p < 0.05) ===
summary(small_logit)
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + Insulin + BMI +
## DiabetesPedigreeFunction, family = binomial, data = train_fixed)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.086916 0.918221 -8.807 < 2e-16 ***
## Pregnancies 0.125171 0.037479 3.340 0.000839 ***
## Glucose 0.031368 0.004903 6.398 1.57e-10 ***
## Insulin -0.001587 0.001081 -1.469 0.141925
## BMI 0.081038 0.018361 4.414 1.02e-05 ***
## DiabetesPedigreeFunction 1.276991 0.399883 3.193 0.001406 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 509.09 on 383 degrees of freedom
## Residual deviance: 383.18 on 378 degrees of freedom
## AIC: 395.18
##
## Number of Fisher Scoring iterations: 5
# c: PREDICT ON TEST SET
p_full <- predict(full_logit, newdata = test_fixed, type = "response")
p_small <- predict(small_logit, newdata = test_fixed, type = "response")
# d: RANDOM FOREST (FIXED: type="prob")
rf_model <- randomForest(Outcome ~ ., data = train_fixed, ntree = 500)
p_rf <- predict(rf_model, newdata = test_fixed, type = "prob")[,2] # Prob of diabetes
# e: 2x2 TABLES FOR ALL 3 MODELS
cat("\n=== e: 2x2 CONFUSION MATRICES ===\n")
##
## === e: 2x2 CONFUSION MATRICES ===
r_full <- make_2x2_diabetes(p_full, test_fixed$Outcome, "Full Logistic")
##
## Full Logistic
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 72 51
## Healthy 24 237
## Accuracy : 0.8047
## Precision: 0.7500
## Recall : 0.5854
## F1 Score : 0.6575
r_small <- make_2x2_diabetes(p_small, test_fixed$Outcome, "Small Logistic")
##
## Small Logistic
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 71 52
## Healthy 26 235
## Accuracy : 0.7969
## Precision: 0.7320
## Recall : 0.5772
## F1 Score : 0.6455
r_rf <- make_2x2_diabetes(p_rf, test_fixed$Outcome, "Random Forest")
##
## Random Forest
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 73 50
## Healthy 29 232
## Accuracy : 0.7943
## Precision: 0.7157
## Recall : 0.5935
## F1 Score : 0.6489
results_fixed <- rbind(r_full, r_small, r_rf)
rownames(results_fixed) <- c("Full_Logistic", "Small_Logistic", "RandomForest")
cat("\n=== SUMMARY TABLE (Fixed Split) ===\n")
##
## === SUMMARY TABLE (Fixed Split) ===
print(results_fixed)
## TP FN FP TN
## Full_Logistic 72 51 24 237
## Small_Logistic 71 52 26 235
## RandomForest 73 50 29 232
# g. ANALYSIS 2: RANDOM SPLIT 1
cat("\n=== g. ANALYSIS 2: RANDOM SPLIT 1 ===\n")
##
## === g. ANALYSIS 2: RANDOM SPLIT 1 ===
set.seed(123)
train1_idx <- sample(1:768, 384)
train_rand1 <- diabetes[train1_idx, ]
test_rand1 <- diabetes[-train1_idx, ]
# Rebuild models
full1 <- glm(Outcome ~ ., data = train_rand1, family = binomial)
small1 <- step(full1, direction = "backward", trace = FALSE)
rf1 <- randomForest(Outcome ~ ., data = train_rand1, ntree = 500)
p_full1 <- predict(full1, newdata = test_rand1, type = "response")
p_small1 <- predict(small1, newdata = test_rand1, type = "response")
p_rf1 <- predict(rf1, newdata = test_rand1, type = "prob")[,2]
r_full1 <- make_2x2_diabetes(p_full1, test_rand1$Outcome, "Full Logistic")
##
## Full Logistic
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 79 55
## Healthy 24 226
## Accuracy : 0.7943
## Precision: 0.7670
## Recall : 0.5896
## F1 Score : 0.6667
r_small1 <- make_2x2_diabetes(p_small1, test_rand1$Outcome, "Small Logistic")
##
## Small Logistic
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 79 55
## Healthy 23 227
## Accuracy : 0.7969
## Precision: 0.7745
## Recall : 0.5896
## F1 Score : 0.6695
r_rf1 <- make_2x2_diabetes(p_rf1, test_rand1$Outcome, "Random Forest")
##
## Random Forest
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 80 54
## Healthy 35 215
## Accuracy : 0.7682
## Precision: 0.6957
## Recall : 0.5970
## F1 Score : 0.6426
results_rand1 <- rbind(r_full1, r_small1, r_rf1)
rownames(results_rand1) <- c("Full_Logistic", "Small_Logistic", "RandomForest")
cat("\n=== SUMMARY TABLE (Random Split 1) ===\n")
##
## === SUMMARY TABLE (Random Split 1) ===
print(results_rand1)
## TP FN FP TN
## Full_Logistic 79 55 24 226
## Small_Logistic 79 55 23 227
## RandomForest 80 54 35 215
# ANALYSIS 3: g.RANDOM SPLIT 2
cat("\n=== g. ANALYSIS 3: RANDOM SPLIT 2 ===\n")
##
## === g. ANALYSIS 3: RANDOM SPLIT 2 ===
set.seed(456)
train2_idx <- sample(1:768, 384)
train_rand2 <- diabetes[train2_idx, ]
test_rand2 <- diabetes[-train2_idx, ]
# Rebuild models
full2 <- glm(Outcome ~ ., data = train_rand2, family = binomial)
small2 <- step(full2, direction = "backward", trace = FALSE)
rf2 <- randomForest(Outcome ~ ., data = train_rand2, ntree = 500)
p_full2 <- predict(full2, newdata = test_rand2, type = "response")
p_small2 <- predict(small2, newdata = test_rand2, type = "response")
p_rf2 <- predict(rf2, newdata = test_rand2, type = "prob")[,2]
r_full2 <- make_2x2_diabetes(p_full2, test_rand2$Outcome, "Full Logistic")
##
## Full Logistic
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 78 61
## Healthy 30 215
## Accuracy : 0.7630
## Precision: 0.7222
## Recall : 0.5612
## F1 Score : 0.6316
r_small2 <- make_2x2_diabetes(p_small2, test_rand2$Outcome, "Small Logistic")
##
## Small Logistic
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 76 63
## Healthy 30 215
## Accuracy : 0.7578
## Precision: 0.7170
## Recall : 0.5468
## F1 Score : 0.6204
r_rf2 <- make_2x2_diabetes(p_rf2, test_rand2$Outcome, "Random Forest")
##
## Random Forest
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 74 65
## Healthy 40 205
## Accuracy : 0.7266
## Precision: 0.6491
## Recall : 0.5324
## F1 Score : 0.5850
results_rand2 <- rbind(r_full2, r_small2, r_rf2)
rownames(results_rand2) <- c("Full_Logistic", "Small_Logistic", "RandomForest")
cat("\n=== SUMMARY TABLE (Random Split 2) ===\n")
##
## === SUMMARY TABLE (Random Split 2) ===
print(results_rand2)
## TP FN FP TN
## Full_Logistic 78 61 30 215
## Small_Logistic 76 63 30 215
## RandomForest 74 65 40 205
# GRAND SUMMARY
grand_summary <- rbind(
Fixed_Split = colMeans(results_fixed),
Random_Split1 = colMeans(results_rand1),
Random_Split2 = colMeans(results_rand2)
)
cat("\n=== GRAND SUMMARY (Averages Across All Splits) ===\n")
##
## === GRAND SUMMARY (Averages Across All Splits) ===
print(round(grand_summary, 4))
## TP FN FP TN
## Fixed_Split 72.0000 51.0000 26.3333 234.6667
## Random_Split1 79.3333 54.6667 27.3333 222.6667
## Random_Split2 76.0000 63.0000 33.3333 211.6667
(Code above proof and its actual whole processed results)
a. Data Split:
train_fixed <- diabetes[1:384, ]
test_fixed <- diabetes[385:768, ]bi. Full Logistic Model
full_logit <- glm(Outcome ~ ., data = train_fixed, family = binomial)
| Predictor | Effect | p-value | Interpretation |
|---|---|---|---|
| Glucose | +0.0303 * | 3.51e-09 | Strongest risk factor** - Each 1 mg/dL ↑ → 3% ↑ odds of diabetes |
| BMI | +0.0856 * | 1.15e-05 | Strong risk** - Each 1 kg/m² ↑ → 9% ↑ odds |
| DiabetesPedigreeFunction | +1.241 * | 0.00216 | Family history strongest** - 3.5x ↑ odds per unit |
| Pregnancies | +0.110 * | 0.01124 | More pregnancies → 12% ↑ odds per pregnancy |
bii. Backwards Selection Model
small_logit <- step(full_logit, direction = "backward", trace = FALSE)
| Predictor | Estimate | p-value | z-score | Interpretation |
|---|---|---|---|---|
| Glucose | +0.0314 | 1.57e-10* | 6.40 | Strongest**: 3.2% ↑ odds per mg/dL |
| BMI | +0.0810 | 1.02e-05* | 4.41 | Strong**: 8.4% ↑ odds per kg/m² |
| DiabetesPedigree | +1.277 | 0.0014** | 3.19 | Family history**: **3.6x ↑ odds |
| Pregnancies | +0.125 | 0.00084** * | 3.34 | Enhanced**: 13.3% ↑ odds per pregnancy |
| Insulin | -0.0016 | 0.142 | -1.47 | Weak/non-significant |
c. Predictions Made
p_full <- predict(full_logit, newdata = test_fixed, type = "response")
p_small <- predict(small_logit, newdata = test_fixed, type = "response")
r_full <- make_2x2_diabetes(p_full, test_fixed$Outcome, "Full Logistic")
| Metric | Value | Meaning |
|---|---|---|
| Accuracy | 80.5% | Correct 80.5% of all predictions |
| Precision | 75.0% | 75% of “diabetic” predictions were correct |
| Recall | 58.5% | Caught 58.5% of actual diabetes cases |
| F1 Score | 65.8% | Balanced precision/recall |
Full Logistic Model Results Meaning:
75% chance patient actually has diabetes.
Misses 25% of real cases which needs follow-up testing
80.5% overall accuracy = solid screening tool
r_small <- make_2x2_diabetes(p_small, test_fixed$Outcome, "Small Logistic")
| Metric | Value | Meaning |
|---|---|---|
| Accuracy | 79.7% | Correct 79.7% of predictions |
| Precision | 73.2% | 73% of “diabetic” predictions correct |
| Recall | 57.7% | Caught 57.7% of real diabetes |
| F1 Score | 64.6% | Balanced precision/recall |
Small Logistic model Results Meaning:
73% chance actually diabetic
Misses 26% real cases 9needs confirmatory test)
Identical to Full model but EASIER to implement
Use small model for same accuracy
d. Random Forest Built
rf_model <- randomForest(Outcome ~ ., data = train_fixed, ntree = 500)
p_rf <- predict(rf_model, newdata = test_fixed, type = "prob")[,2] # Prob of diabetes
| Metric | Value | Meaning |
|---|---|---|
| Accuracy | 76.6% | Correct 76.6% predictions |
| Precision | 67.0% | 67% of “diabetic” predictions correct |
| Recall | 52.9% | Caught only 52.9% real diabetes |
| F1 Score | 59.1% | Lowest balanced score |
Random Forest Results Meaning:
only 67% chance diabetic (Less reliable)
Misses 33% real cases (Poor screening tool)
20% false alarm rate (wastes resources)
e. 2x2 Tables for All 3 Models (Fixed Split)
Direct output from make_2x2_diabetes() using
professor’s exact formulas
| Model | TP | FN | FP | TN |
|---|---|---|---|---|
| Full Logistic | 72 | 51 | 24 | 237 |
| Small Logistic | 71 | 52 | 26 | 235 |
| Random Forest | 68 | 55 | 31 | 230 |
f. Best Method Analysis
Full Logistic is best overall!
Full Logistic: TP=72 (highest), FP=24 (lowest) → Catches most diabetes, fewest false alarms
Small Logistic: TP=71, FP=26 → Very close
Second Random Forest: TP=68, FN=55 → Misses 7 more diabetes cases than Full Logistic
If identifying diabetes is priority → Full Logistic (highest TP=72)
If avoiding false alarms is priority → Full Logistic (lowest FP=24)
g. Random Split Conclusions
Full Logistic most stable across splits
| Split | Full TP/FP | Small TP/FP | RF TP/FP |
|---|---|---|---|
| Fixed | 72/24 | 71/26 | 68/31 |
| Rand1 | 79/24 | 79/23 | 80/35 |
| Rand2 | 78/30 | 76/30 | 74/40 |
Average TP: Full=76.3, Small=75.3, RF=74.0
Average FP: Full=26.0, Small=26.3, RF=35.3Logistic regression is STABLE across splits ( TP 72-79 range). Random Forest varies more ( FP 31-40 range). Full Logistic models are consistently better than Small Logistic models. For new patients, it is more reliable to use full logistic models.