A school has 1,600 students voting on whether to convert the school off fossil fuels. How many students would you need to poll to be 95% confident of the outcome within ±0.6% of the vote?
For estimating a proportion with a finite population, the professor’s formula from class (slide 4) is:
\[n = \frac{N}{\left(\frac{ME}{z_{\alpha/2} \cdot \sigma}\right)^2 (N-1) + 1}\]
where \(z = 1.96\) (95% confidence), \(\sigma = \sqrt{p(1-p)} = 0.5\) (maximum variance, most conservative with \(p=0.5\)), \(ME = 0.006\) (±0.6%), and \(N = 1600\).
# Parameters
z <- 1.96 # 95% confidence z-score
sig <- 0.5 # sigma = sqrt(p*(1-p)), maximized at p = 0.5
ME <- 0.006 # margin of error = 0.6%
N <- 1600 # population size
# Direct formula from class slides:
# n = N / ( (ME / (z * sigma))^2 * (N-1) + 1 )
n_finite <- N / ((ME / (z * sig))^2 * (N - 1) + 1)
cat("Finite-population n =", n_finite, "\n")## Finite-population n = 1509.523
## Round up to = 1510
The required sample size is 1510 students (closest answer choice: 1510).

Test whether earthquake occurrence in one time period is statistically independent of occurrence in the next period (same earthquake type). Test at α = 0.01.
| Earthquake (t+1) | No Earthquake (t+1) | Row Total | |
|---|---|---|---|
| Earthquake (t) | 148 | 274 | 422 |
| No Earthquake (t) | 276 | 2626 | 2902 |
| Column Total | 424 | 2900 | 3324 |
# Observed frequency matrix
obs <- matrix(c(148, 274, 276, 2626), nrow = 2, byrow = TRUE,
dimnames = list(
c("EQ at t", "No EQ at t"),
c("EQ at t+1", "No EQ at t+1")
))
cat("Observed frequencies:\n")## Observed frequencies:
## EQ at t+1 No EQ at t+1
## EQ at t 148 274
## No EQ at t 276 2626
##
## Pearson's Chi-squared test
##
## data: obs
## X-squared = 216.29, df = 1, p-value < 2.2e-16
## Expected frequencies under independence:
## EQ at t+1 No EQ at t+1
## EQ at t 53.829 368.171
## No EQ at t 370.171 2531.829
## Deviation table (Observed - Expected):
## EQ at t+1 No EQ at t+1
## EQ at t 94.171 -94.171
## No EQ at t -94.171 94.171
## Chi-square statistic: 216.2931
## Degrees of freedom: 1
## p-value: 5.820944e-49
## Critical value (α=0.01, df=1): 6.6349
cat("\nDecision:", ifelse(result$p.value < 0.01,
"REJECT H0 — earthquakes are NOT independent across periods.",
"Fail to reject H0 — no evidence of dependence."), "\n")##
## Decision: REJECT H0 — earthquakes are NOT independent across periods.
Cells with higher-than-expected occurrence (positive deviations): The diagonal cells — (EQ → EQ) and (No EQ → No EQ) — show positive deviations, meaning earthquakes tend to cluster: an earthquake period is more likely to be followed by another earthquake, and a quiet period tends to follow a quiet period.

Same test, now for earthquakes of different types following one another (α = 0.01).
| EQ (t+1) | No EQ (t+1) | Row Total | |
|---|---|---|---|
| EQ (t) | 5 | 314 | 319 |
| No EQ (t) | 314 | 2691 | 3005 |
| Column Total | 319 | 3005 | 3324 |
obs2 <- matrix(c(5, 314, 314, 2691), nrow = 2, byrow = TRUE,
dimnames = list(
c("EQ at t", "No EQ at t"),
c("EQ at t+1", "No EQ at t+1")
))
cat("Observed frequencies:\n")## Observed frequencies:
## EQ at t+1 No EQ at t+1
## EQ at t 5 314
## No EQ at t 314 2691
##
## Pearson's Chi-squared test
##
## data: obs2
## X-squared = 26.222, df = 1, p-value = 3.043e-07
## Deviation table (Observed - Expected):
## EQ at t+1 No EQ at t+1
## EQ at t -25.614 25.614
## No EQ at t 25.614 -25.614
##
## Chi-square statistic: 26.2221
## p-value: 3.04313e-07
cat("\nDecision:", ifelse(result2$p.value < 0.01,
"REJECT H0 — different-type earthquakes are NOT independent.",
"Fail to reject H0."), "\n")##
## Decision: REJECT H0 — different-type earthquakes are NOT independent.
| Cell | Same Type (Prob. 2) | Different Type (Prob. 3) |
|---|---|---|
| EQ → EQ | + (more than expected) | − (less than expected) |
| EQ → No EQ | − | + |
| No EQ → EQ | − | + |
| No EQ → No EQ | + | − |
Pattern reversal: In Problem 2 (same type), the diagonal cells are inflated — earthquakes of the same type cluster together. In Problem 3 (different types), the off-diagonal cells are inflated — a same-type earthquake is less likely to be followed by a different-type earthquake than independence would predict.
Physical interpretation: The two fault-slip types appear to be mutually inhibitory. When a fault ruptures in one direction (say, strike-slip), it relieves stress in a way that makes an immediate rupture in the other direction (dip-slip) less likely. Conversely, consecutive quiet periods remain common. This suggests the two mechanisms draw on the same regional stress budget but suppress each other in the short term.

##
## BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA
## 7 5 7 1 1 6
## MCF7A-repro MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE
## 1 1 8 9 6 2
## RENAL UNKNOWN
## 9 1
cancer_counts <- table(NCI60$labs)
cancers_large <- names(cancer_counts[cancer_counts > 3])
cat("Cancers with >3 cell lines:\n")## Cancers with >3 cell lines:
## [1] "BREAST" "CNS" "COLON" "LEUKEMIA" "MELANOMA" "NSCLC" "OVARIAN"
## [8] "RENAL"
Cancers with more than 3 cell lines: BREAST (7), CNS (5), COLON (7), LEUKEMIA (6), MELANOMA (8), NSCLC (9), OVARIAN (6), RENAL (9).
# Professor's exact fdr.pck code from class
fdr <- function(v1, Q, ind = F) {
o1 <- order(v1)
pvec <- v1[o1]
m <- length(v1)
qline <- Q * c(1:m) / m
if (!ind) {
c1 <- sum(1 / (c(1:m)))
qline <- Q * c(1:m) / (m * c1)
}
plot(c(c(1:m), c(1:m)), c(qline, pvec), type = "n",
xlab = "ordering", ylab = "pvalue")
lines(c(1:m), qline)
points(c(1:m), pvec)
dv <- pvec - qline
I22 <- !is.na(dv)
I1 <- (dv[I22] < 0)
I0 <- I1
if (sum(I0) > .5) {
pmax <- max(pvec[I22][I1])
I2 <- pvec[I22] <= pmax
points(c(1:m)[I2], pvec[I2], col = "red")
out <- list(interesting = o1[I22][I2], ind = ind)
} else {
vec <- qbeta(c(.5, .95, .99, .999), 1, length(v1) + 1)
out <- list(q.5 = vec[1], q.95 = vec[2], q.99 = vec[3], q.999 = vec[4])
}
out
}
# myt: one-sample t-test against 0 (professor's exact method)
myt <- function(x) {
t.test(x, mu = 0)$p.value
}
# Test each cancer type — subset JUST that cancer's data, apply myt per gene
results <- list()
for (cancer in cancers_large) {
cat("Testing", cancer, "...\n")
cancer_data <- NCI60$data[NCI60$labs == cancer, ]
pvals <- apply(cancer_data, 2, myt)
fdr_result <- fdr(pvals, Q = 0.2, ind = FALSE)
if (is.logical(fdr_result$interesting)) {
interesting <- which(fdr_result$interesting)
} else {
interesting <- fdr_result$interesting
}
results[[cancer]] <- list(
cancer = cancer,
n_samples = nrow(cancer_data),
interesting_genes = interesting,
n_interesting = length(interesting),
pvals = pvals
)
cat(" Found", length(interesting), "interesting genes\n\n")
}## Testing BREAST ...
## Found 0 interesting genes
##
## Testing CNS ...
## Found 0 interesting genes
##
## Testing COLON ...
## Found 48 interesting genes
##
## Testing LEUKEMIA ...
## Found 25 interesting genes
##
## Testing MELANOMA ...
## Found 41 interesting genes
##
## Testing NSCLC ...
## Found 0 interesting genes
##
## Testing OVARIAN ...
## Found 0 interesting genes
##
## Testing RENAL ...
## Found 10 interesting genes
## === RESULTS AT FDR Q=0.2 (ind=FALSE) ===
for (cancer in names(results)) {
cat(sprintf("%-10s: %d genes (n=%d samples)\n",
cancer, results[[cancer]]$n_interesting, results[[cancer]]$n_samples))
}## BREAST : 0 genes (n=7 samples)
## CNS : 0 genes (n=5 samples)
## COLON : 48 genes (n=7 samples)
## LEUKEMIA : 25 genes (n=6 samples)
## MELANOMA : 41 genes (n=8 samples)
## NSCLC : 0 genes (n=9 samples)
## OVARIAN : 0 genes (n=6 samples)
## RENAL : 10 genes (n=9 samples)
Cancers with hyper/hypo active genes at FDR Q=0.2 (not independent): COLON (48 genes), MELANOMA (41 genes), LEUKEMIA (25 genes), RENAL (10 genes). BREAST, CNS, NSCLC, and OVARIAN had no significant genes at this threshold.
# Hyper = mean expression > 0 for interesting genes
hyper_genes <- list()
for (cancer in names(results)) {
cancer_data <- NCI60$data[NCI60$labs == cancer, ]
means <- colMeans(cancer_data[, results[[cancer]]$interesting_genes, drop = FALSE])
hyper_genes[[cancer]] <- results[[cancer]]$interesting_genes[means > 0]
}
cat("=== COMMON HYPER-EXPRESSED GENES ===\n")## === COMMON HYPER-EXPRESSED GENES ===
pairs <- combn(names(results), 2, simplify = FALSE)
for (pair in pairs) {
c1 <- pair[1]; c2 <- pair[2]
common_hyper <- intersect(hyper_genes[[c1]], hyper_genes[[c2]])
cat(sprintf("%s vs %s: %d common hyper-expressed genes\n", c1, c2, length(common_hyper)))
if (length(common_hyper) > 0) {
cat(" Genes:", paste(head(common_hyper, 10), collapse = ", "),
if (length(common_hyper) > 10) "...\n" else "\n")
}
}## BREAST vs CNS: 0 common hyper-expressed genes
## BREAST vs COLON: 0 common hyper-expressed genes
## BREAST vs LEUKEMIA: 0 common hyper-expressed genes
## BREAST vs MELANOMA: 0 common hyper-expressed genes
## BREAST vs NSCLC: 0 common hyper-expressed genes
## BREAST vs OVARIAN: 0 common hyper-expressed genes
## BREAST vs RENAL: 0 common hyper-expressed genes
## CNS vs COLON: 0 common hyper-expressed genes
## CNS vs LEUKEMIA: 0 common hyper-expressed genes
## CNS vs MELANOMA: 0 common hyper-expressed genes
## CNS vs NSCLC: 0 common hyper-expressed genes
## CNS vs OVARIAN: 0 common hyper-expressed genes
## CNS vs RENAL: 0 common hyper-expressed genes
## COLON vs LEUKEMIA: 0 common hyper-expressed genes
## COLON vs MELANOMA: 0 common hyper-expressed genes
## COLON vs NSCLC: 0 common hyper-expressed genes
## COLON vs OVARIAN: 0 common hyper-expressed genes
## COLON vs RENAL: 0 common hyper-expressed genes
## LEUKEMIA vs MELANOMA: 0 common hyper-expressed genes
## LEUKEMIA vs NSCLC: 0 common hyper-expressed genes
## LEUKEMIA vs OVARIAN: 0 common hyper-expressed genes
## LEUKEMIA vs RENAL: 0 common hyper-expressed genes
## MELANOMA vs NSCLC: 0 common hyper-expressed genes
## MELANOMA vs OVARIAN: 0 common hyper-expressed genes
## MELANOMA vs RENAL: 0 common hyper-expressed genes
## NSCLC vs OVARIAN: 0 common hyper-expressed genes
## NSCLC vs RENAL: 0 common hyper-expressed genes
## OVARIAN vs RENAL: 0 common hyper-expressed genes
Result: There are no common hyper-expressed genes between any pair of cancers. Each cancer type has unique gene dysregulation patterns specific to its biology.
# Hypo = mean expression < 0 for interesting genes
hypo_genes <- list()
for (cancer in names(results)) {
cancer_data <- NCI60$data[NCI60$labs == cancer, ]
means <- colMeans(cancer_data[, results[[cancer]]$interesting_genes, drop = FALSE])
hypo_genes[[cancer]] <- results[[cancer]]$interesting_genes[means < 0]
}
cat("=== COMMON HYPO-EXPRESSED GENES ===\n")## === COMMON HYPO-EXPRESSED GENES ===
for (pair in pairs) {
c1 <- pair[1]; c2 <- pair[2]
common_hypo <- intersect(hypo_genes[[c1]], hypo_genes[[c2]])
cat(sprintf("%s vs %s: %d common hypo-expressed genes\n", c1, c2, length(common_hypo)))
if (length(common_hypo) > 0) {
cat(" Genes:", paste(head(common_hypo, 10), collapse = ", "),
if (length(common_hypo) > 10) "...\n" else "\n")
}
}## BREAST vs CNS: 0 common hypo-expressed genes
## BREAST vs COLON: 0 common hypo-expressed genes
## BREAST vs LEUKEMIA: 0 common hypo-expressed genes
## BREAST vs MELANOMA: 0 common hypo-expressed genes
## BREAST vs NSCLC: 0 common hypo-expressed genes
## BREAST vs OVARIAN: 0 common hypo-expressed genes
## BREAST vs RENAL: 0 common hypo-expressed genes
## CNS vs COLON: 0 common hypo-expressed genes
## CNS vs LEUKEMIA: 0 common hypo-expressed genes
## CNS vs MELANOMA: 0 common hypo-expressed genes
## CNS vs NSCLC: 0 common hypo-expressed genes
## CNS vs OVARIAN: 0 common hypo-expressed genes
## CNS vs RENAL: 0 common hypo-expressed genes
## COLON vs LEUKEMIA: 0 common hypo-expressed genes
## COLON vs MELANOMA: 0 common hypo-expressed genes
## COLON vs NSCLC: 0 common hypo-expressed genes
## COLON vs OVARIAN: 0 common hypo-expressed genes
## COLON vs RENAL: 0 common hypo-expressed genes
## LEUKEMIA vs MELANOMA: 0 common hypo-expressed genes
## LEUKEMIA vs NSCLC: 0 common hypo-expressed genes
## LEUKEMIA vs OVARIAN: 0 common hypo-expressed genes
## LEUKEMIA vs RENAL: 0 common hypo-expressed genes
## MELANOMA vs NSCLC: 0 common hypo-expressed genes
## MELANOMA vs OVARIAN: 0 common hypo-expressed genes
## MELANOMA vs RENAL: 0 common hypo-expressed genes
## NSCLC vs OVARIAN: 0 common hypo-expressed genes
## NSCLC vs RENAL: 0 common hypo-expressed genes
## OVARIAN vs RENAL: 0 common hypo-expressed genes
Result: There are no common hypo-expressed genes between any pair of cancers. The conservative FDR=0.2 threshold and cancer-specific biology result in no shared hypo-expressed genes.
## === HYPER IN ONE, HYPO IN ANOTHER ===
for (i in 1:length(names(results))) {
for (j in (i + 1):length(names(results))) {
c1 <- names(results)[i]; c2 <- names(results)[j]
hyper_c1_hypo_c2 <- intersect(hyper_genes[[c1]], hypo_genes[[c2]])
hyper_c2_hypo_c1 <- intersect(hyper_genes[[c2]], hypo_genes[[c1]])
if (length(hyper_c1_hypo_c2) > 0) {
cat(sprintf("%s HYPER & %s HYPO: %d genes\n", c1, c2, length(hyper_c1_hypo_c2)))
cat(" Genes:", paste(head(hyper_c1_hypo_c2, 5), collapse = ", "), "\n")
}
if (length(hyper_c2_hypo_c1) > 0) {
cat(sprintf("%s HYPER & %s HYPO: %d genes\n", c2, c1, length(hyper_c2_hypo_c1)))
cat(" Genes:", paste(head(hyper_c2_hypo_c1, 5), collapse = ", "), "\n")
}
}
}## COLON HYPER & MELANOMA HYPO: 1 genes
## Genes: 247
## MELANOMA HYPER & COLON HYPO: 1 genes
## Genes: 4384
Result: Gene 247 is hyper-expressed in COLON and hypo-expressed in MELANOMA. Gene 4384 is hyper-expressed in MELANOMA and hypo-expressed in COLON. These opposite expression patterns reflect cancer-specific regulatory mechanisms between COLON and MELANOMA cell lines.
library(randomForest)
library(readr)
diabetes <- read_csv("diabetes.csv")
cat("Total rows:", nrow(diabetes), "\n\n")## Total rows: 768
# Make outcome factor (0=healthy, 1=diabetes)
diabetes$Outcome <- as.factor(diabetes$Outcome)
# 2x2 Confusion matrix function (professor's exact method)
make_2x2_diabetes <- function(pred_probs, actuals, label) {
act <- as.numeric(actuals) - 1 # factor 1,2 -> 0,1
pred <- as.numeric(pred_probs > 0.5)
TP <- sum(pred * act) # Correct positives (diabetes detected)
FN <- sum((1 - pred) * act) # False negatives (missed diabetes)
FP <- sum(pred * (1 - act)) # False positives (healthy called diabetic)
TN <- sum((1 - pred) * (1 - act)) # Correct negatives (healthy correctly identified)
mat <- matrix(c(TP, FN, FP, TN), nrow = 2, byrow = TRUE,
dimnames = list(Predicted = c("Diabetes", "Healthy"),
Actual = c("Diabetes", "Healthy")))
cat("\n", label, "\n")
cat("--------------------------------\n")
print(mat)
cat(sprintf(" Accuracy : %.4f\n", (TP + TN) / (TP + TN + FP + FN)))
cat(sprintf(" Precision: %.4f\n", TP / (TP + FP)))
cat(sprintf(" Recall : %.4f\n", TP / (TP + FN)))
return(c(TP = TP, FN = FN, FP = FP, TN = TN))
}## === ANALYSIS 1: FIRST 384 TRAIN / SECOND 384 TEST ===
train_fixed <- diabetes[1:384, ]
test_fixed <- diabetes[385:768, ]
cat("Training set:", nrow(train_fixed), "rows\n")## Training set: 384 rows
## Test set: 384 rows
## === b.i: FULL LOGISTIC MODEL ===
##
## Call:
## glm(formula = Outcome ~ ., family = binomial, data = train_fixed)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.934945 0.991337 -8.004 1.20e-15 ***
## Pregnancies 0.109780 0.043305 2.535 0.01124 *
## Glucose 0.030267 0.005125 5.906 3.51e-09 ***
## BloodPressure -0.006948 0.007094 -0.979 0.32737
## SkinThickness -0.001239 0.009637 -0.129 0.89769
## Insulin -0.001402 0.001233 -1.137 0.25554
## BMI 0.085641 0.019520 4.387 1.15e-05 ***
## DiabetesPedigreeFunction 1.241454 0.404700 3.068 0.00216 **
## Age 0.011778 0.013390 0.880 0.37906
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 509.09 on 383 degrees of freedom
## Residual deviance: 381.62 on 375 degrees of freedom
## AIC: 399.62
##
## Number of Fisher Scoring iterations: 5
## === b.ii: SMALLEST MODEL (backward selection) ===
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + Insulin + BMI +
## DiabetesPedigreeFunction, family = binomial, data = train_fixed)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.086916 0.918221 -8.807 < 2e-16 ***
## Pregnancies 0.125171 0.037479 3.340 0.000839 ***
## Glucose 0.031368 0.004903 6.398 1.57e-10 ***
## Insulin -0.001587 0.001081 -1.469 0.141925
## BMI 0.081038 0.018361 4.414 1.02e-05 ***
## DiabetesPedigreeFunction 1.276991 0.399883 3.193 0.001406 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 509.09 on 383 degrees of freedom
## Residual deviance: 383.18 on 378 degrees of freedom
## AIC: 395.18
##
## Number of Fisher Scoring iterations: 5
p_full <- predict(full_logit, newdata = test_fixed, type = "response")
p_small <- predict(small_logit, newdata = test_fixed, type = "response")
cat("Full model — first 10 predicted probabilities:\n")## Full model — first 10 predicted probabilities:
## 1 2 3 4 5 6 7 8 9 10
## 0.113 0.096 0.395 0.479 0.489 0.273 0.144 0.854 0.105 0.152
##
## Small model — first 10 predicted probabilities:
## 1 2 3 4 5 6 7 8 9 10
## 0.123 0.096 0.410 0.506 0.444 0.283 0.122 0.868 0.110 0.154
rf_model <- randomForest(Outcome ~ ., data = train_fixed, ntree = 500)
p_rf <- predict(rf_model, newdata = test_fixed, type = "prob")[, 2]
cat("Random Forest — first 10 predicted probabilities:\n")## Random Forest — first 10 predicted probabilities:
## 1 2 3 4 5 6 7 8 9 10
## 0.148 0.086 0.428 0.410 0.506 0.444 0.420 0.862 0.098 0.094
## === e: 2x2 CONFUSION MATRICES ===
##
## Full Logistic
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 72 51
## Healthy 24 237
## Accuracy : 0.8047
## Precision: 0.7500
## Recall : 0.5854
##
## Small Logistic
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 71 52
## Healthy 26 235
## Accuracy : 0.7969
## Precision: 0.7320
## Recall : 0.5772
##
## Random Forest
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 72 51
## Healthy 31 230
## Accuracy : 0.7865
## Precision: 0.6990
## Recall : 0.5854
results_fixed <- rbind(r_full, r_small, r_rf)
rownames(results_fixed) <- c("Full_Logistic", "Small_Logistic", "RandomForest")
cat("\n=== SUMMARY TABLE (Fixed Split) ===\n")##
## === SUMMARY TABLE (Fixed Split) ===
## TP FN FP TN
## Full_Logistic 72 51 24 237
## Small_Logistic 71 52 26 235
## RandomForest 72 51 31 230
Full Logistic is best overall. Across the fixed split: Full Logistic has the highest correct positives (TP) and lowest false positives (FP), meaning it catches the most diabetes cases while raising the fewest false alarms.
## === g. ANALYSIS 2: RANDOM SPLIT 1 ===
set.seed(123)
train1_idx <- sample(1:768, 384)
train_rand1 <- diabetes[train1_idx, ]
test_rand1 <- diabetes[-train1_idx, ]
full1 <- glm(Outcome ~ ., data = train_rand1, family = binomial)
small1 <- step(full1, direction = "backward", trace = FALSE)
rf1 <- randomForest(Outcome ~ ., data = train_rand1, ntree = 500)
p_full1 <- predict(full1, newdata = test_rand1, type = "response")
p_small1 <- predict(small1, newdata = test_rand1, type = "response")
p_rf1 <- predict(rf1, newdata = test_rand1, type = "prob")[, 2]
r_full1 <- make_2x2_diabetes(p_full1, test_rand1$Outcome, "Full Logistic")##
## Full Logistic
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 79 55
## Healthy 24 226
## Accuracy : 0.7943
## Precision: 0.7670
## Recall : 0.5896
##
## Small Logistic
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 79 55
## Healthy 23 227
## Accuracy : 0.7969
## Precision: 0.7745
## Recall : 0.5896
##
## Random Forest
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 80 54
## Healthy 35 215
## Accuracy : 0.7682
## Precision: 0.6957
## Recall : 0.5970
results_rand1 <- rbind(r_full1, r_small1, r_rf1)
rownames(results_rand1) <- c("Full_Logistic", "Small_Logistic", "RandomForest")
cat("\n=== SUMMARY TABLE (Random Split 1) ===\n")##
## === SUMMARY TABLE (Random Split 1) ===
## TP FN FP TN
## Full_Logistic 79 55 24 226
## Small_Logistic 79 55 23 227
## RandomForest 80 54 35 215
##
## === g. ANALYSIS 3: RANDOM SPLIT 2 ===
set.seed(456)
train2_idx <- sample(1:768, 384)
train_rand2 <- diabetes[train2_idx, ]
test_rand2 <- diabetes[-train2_idx, ]
full2 <- glm(Outcome ~ ., data = train_rand2, family = binomial)
small2 <- step(full2, direction = "backward", trace = FALSE)
rf2 <- randomForest(Outcome ~ ., data = train_rand2, ntree = 500)
p_full2 <- predict(full2, newdata = test_rand2, type = "response")
p_small2 <- predict(small2, newdata = test_rand2, type = "response")
p_rf2 <- predict(rf2, newdata = test_rand2, type = "prob")[, 2]
r_full2 <- make_2x2_diabetes(p_full2, test_rand2$Outcome, "Full Logistic")##
## Full Logistic
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 78 61
## Healthy 30 215
## Accuracy : 0.7630
## Precision: 0.7222
## Recall : 0.5612
##
## Small Logistic
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 76 63
## Healthy 30 215
## Accuracy : 0.7578
## Precision: 0.7170
## Recall : 0.5468
##
## Random Forest
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 74 65
## Healthy 40 205
## Accuracy : 0.7266
## Precision: 0.6491
## Recall : 0.5324
results_rand2 <- rbind(r_full2, r_small2, r_rf2)
rownames(results_rand2) <- c("Full_Logistic", "Small_Logistic", "RandomForest")
cat("\n=== SUMMARY TABLE (Random Split 2) ===\n")##
## === SUMMARY TABLE (Random Split 2) ===
## TP FN FP TN
## Full_Logistic 78 61 30 215
## Small_Logistic 76 63 30 215
## RandomForest 74 65 40 205
Conclusions from repeated splits:
Logistic regression is stable across all three splits (TP range 71–79), while Random Forest varies more (FP range 29–40). The Full Logistic model is consistently better than or equal to the Small Logistic model. For new patients, the Full Logistic model is the most reliable approach. Random Forest’s higher false positive rate across splits suggests it overfits slightly on this small dataset.