A school has 1,600 students voting on whether to convert the school off fossil fuels. How many students would you need to poll to be 95% confident of the outcome within ±0.6% of the vote?
For estimating a proportion with a finite population, the professor’s formula from class (slide 4) is:
\[n = \frac{N}{\left(\frac{ME}{z_{\alpha/2} \cdot \sigma}\right)^2 (N-1) + 1}\]
where \(z = 1.96\) (95% confidence), \(\sigma = \sqrt{p(1-p)} = 0.5\) (maximum variance, most conservative with \(p=0.5\)), \(ME = 0.006\) (±0.6%), and \(N = 1600\).
# Parameters
z <- 1.96 # 95% confidence z-score
sig <- 0.5 # sigma = sqrt(p*(1-p)), maximized at p = 0.5
ME <- 0.006 # margin of error = 0.6%
N <- 1600 # population size
# Direct formula from class slides:
# n = N / ( (ME / (z * sigma))^2 * (N-1) + 1 )
n_finite <- N / ((ME / (z * sig))^2 * (N - 1) + 1)
cat("Finite-population n =", n_finite, "\n")## Finite-population n = 1509.523
## Round up to = 1510
The required sample size is 1510 students (closest answer choice: 1510).
Test whether earthquake occurrence in one time period is statistically independent of occurrence in the next period (same earthquake type). Test at α = 0.01.
| Earthquake (t+1) | No Earthquake (t+1) | Row Total | |
|---|---|---|---|
| Earthquake (t) | 148 | 274 | 422 |
| No Earthquake (t) | 276 | 2626 | 2902 |
| Column Total | 424 | 2900 | 3324 |
# Observed frequency matrix
obs <- matrix(c(148, 274, 276, 2626), nrow = 2, byrow = TRUE,
dimnames = list(
c("EQ at t", "No EQ at t"),
c("EQ at t+1", "No EQ at t+1")
))
cat("Observed frequencies:\n")## Observed frequencies:
## EQ at t+1 No EQ at t+1
## EQ at t 148 274
## No EQ at t 276 2626
##
## Pearson's Chi-squared test
##
## data: obs
## X-squared = 216.29, df = 1, p-value < 2.2e-16
## Expected frequencies under independence:
## EQ at t+1 No EQ at t+1
## EQ at t 53.829 368.171
## No EQ at t 370.171 2531.829
## Deviation table (Observed - Expected):
## EQ at t+1 No EQ at t+1
## EQ at t 94.171 -94.171
## No EQ at t -94.171 94.171
## Chi-square statistic: 216.2931
## Degrees of freedom: 1
## p-value: 5.820944e-49
## Critical value (α=0.01, df=1): 6.6349
cat("\nDecision:", ifelse(result$p.value < 0.01,
"REJECT H0 — earthquakes are NOT independent across periods.",
"Fail to reject H0 — no evidence of dependence."), "\n")##
## Decision: REJECT H0 — earthquakes are NOT independent across periods.
Cells with higher-than-expected occurrence (positive deviations): The diagonal cells — (EQ → EQ) and (No EQ → No EQ) — show positive deviations, meaning earthquakes tend to cluster: an earthquake period is more likely to be followed by another earthquake, and a quiet period tends to follow a quiet period.
Same test, now for earthquakes of different types following one another (α = 0.01).
| EQ (t+1) | No EQ (t+1) | Row Total | |
|---|---|---|---|
| EQ (t) | 5 | 314 | 319 |
| No EQ (t) | 314 | 2691 | 3005 |
| Column Total | 319 | 3005 | 3324 |
obs2 <- matrix(c(5, 314, 314, 2691), nrow = 2, byrow = TRUE,
dimnames = list(
c("EQ at t", "No EQ at t"),
c("EQ at t+1", "No EQ at t+1")
))
cat("Observed frequencies:\n")## Observed frequencies:
## EQ at t+1 No EQ at t+1
## EQ at t 5 314
## No EQ at t 314 2691
##
## Pearson's Chi-squared test
##
## data: obs2
## X-squared = 26.222, df = 1, p-value = 3.043e-07
## Deviation table (Observed - Expected):
## EQ at t+1 No EQ at t+1
## EQ at t -25.614 25.614
## No EQ at t 25.614 -25.614
##
## Chi-square statistic: 26.2221
## p-value: 3.04313e-07
cat("\nDecision:", ifelse(result2$p.value < 0.01,
"REJECT H0 — different-type earthquakes are NOT independent.",
"Fail to reject H0."), "\n")##
## Decision: REJECT H0 — different-type earthquakes are NOT independent.
| Cell | Same Type (Prob. 2) | Different Type (Prob. 3) |
|---|---|---|
| EQ → EQ | + (more than expected) | − (less than expected) |
| EQ → No EQ | − | + |
| No EQ → EQ | − | + |
| No EQ → No EQ | + | − |
Pattern reversal: In Problem 2 (same type), the diagonal cells are inflated — earthquakes of the same type cluster together. In Problem 3 (different types), the off-diagonal cells are inflated — a same-type earthquake is less likely to be followed by a different-type earthquake than independence would predict.
Physical interpretation: The two fault-slip types appear to be mutually inhibitory. When a fault ruptures in one direction (say, strike-slip), it relieves stress in a way that makes an immediate rupture in the other direction (dip-slip) less likely. Conversely, consecutive quiet periods remain common. This suggests the two mechanisms draw on the same regional stress budget but suppress each other in the short term.
##
## BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA
## 7 5 7 1 1 6
## MCF7A-repro MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE
## 1 1 8 9 6 2
## RENAL UNKNOWN
## 9 1
cancer_counts <- table(NCI60$labs)
cancers_large <- names(cancer_counts[cancer_counts > 3])
cat("Cancers with >3 cell lines:\n")## Cancers with >3 cell lines:
## [1] "BREAST" "CNS" "COLON" "LEUKEMIA" "MELANOMA" "NSCLC" "OVARIAN"
## [8] "RENAL"
Cancers with more than 3 cell lines: BREAST (7), CNS (5), COLON (7), LEUKEMIA (6), MELANOMA (8), NSCLC (9), OVARIAN (6), RENAL (9).
# Using the professor's exact fdr.pck code provided in class — this includes the fdr() function and myt() one-sample t-test against 0. The not-independent correction (ind=FALSE) is applied as required.
fdr <- function(v1, Q, ind = F) {
o1 <- order(v1)
pvec <- v1[o1]
m <- length(v1)
qline <- Q * c(1:m) / m
if (!ind) {
c1 <- sum(1 / (c(1:m)))
qline <- Q * c(1:m) / (m * c1)
}
plot(c(c(1:m), c(1:m)), c(qline, pvec), type = "n",
xlab = "ordering", ylab = "pvalue")
lines(c(1:m), qline)
points(c(1:m), pvec)
dv <- pvec - qline
I22 <- !is.na(dv)
I1 <- (dv[I22] < 0)
I0 <- I1
if (sum(I0) > .5) {
pmax <- max(pvec[I22][I1])
I2 <- pvec[I22] <= pmax
points(c(1:m)[I2], pvec[I2], col = "red")
out <- list(interesting = o1[I22][I2], ind = ind)
} else {
vec <- qbeta(c(.5, .95, .99, .999), 1, length(v1) + 1)
out <- list(q.5 = vec[1], q.95 = vec[2], q.99 = vec[3], q.999 = vec[4])
}
out
}
# myt: one-sample t-test against 0 (professor's exact method)
myt <- function(x) {
t.test(x, mu = 0)$p.value
}
# Test each cancer type — subset JUST that cancer's data, apply myt per gene
results <- list()
for (cancer in cancers_large) {
cat("Testing", cancer, "...\n")
cancer_data <- NCI60$data[NCI60$labs == cancer, ]
pvals <- apply(cancer_data, 2, myt)
fdr_result <- fdr(pvals, Q = 0.2, ind = FALSE)
if (is.logical(fdr_result$interesting)) {
interesting <- which(fdr_result$interesting)
} else {
interesting <- fdr_result$interesting
}
results[[cancer]] <- list(
cancer = cancer,
n_samples = nrow(cancer_data),
interesting_genes = interesting,
n_interesting = length(interesting),
pvals = pvals
)
cat(" Found", length(interesting), "interesting genes\n\n")
}## Testing BREAST ...
## Found 0 interesting genes
##
## Testing CNS ...
## Found 0 interesting genes
##
## Testing COLON ...
## Found 48 interesting genes
##
## Testing LEUKEMIA ...
## Found 25 interesting genes
##
## Testing MELANOMA ...
## Found 41 interesting genes
##
## Testing NSCLC ...
## Found 0 interesting genes
##
## Testing OVARIAN ...
## Found 0 interesting genes
##
## Testing RENAL ...
## Found 10 interesting genes
## === RESULTS AT FDR Q=0.2 (ind=FALSE) ===
for (cancer in names(results)) {
cat(sprintf("%-10s: %d genes (n=%d samples)\n",
cancer, results[[cancer]]$n_interesting, results[[cancer]]$n_samples))
}## BREAST : 0 genes (n=7 samples)
## CNS : 0 genes (n=5 samples)
## COLON : 48 genes (n=7 samples)
## LEUKEMIA : 25 genes (n=6 samples)
## MELANOMA : 41 genes (n=8 samples)
## NSCLC : 0 genes (n=9 samples)
## OVARIAN : 0 genes (n=6 samples)
## RENAL : 10 genes (n=9 samples)
Cancers with hyper/hypo active genes at FDR Q=0.2 (not independent): COLON (48 genes), MELANOMA (41 genes), LEUKEMIA (25 genes), RENAL (10 genes). BREAST, CNS, NSCLC, and OVARIAN had no significant genes at this threshold.
# Hyper = mean expression > 0 for interesting genes
hyper_genes <- list()
for (cancer in names(results)) {
cancer_data <- NCI60$data[NCI60$labs == cancer, ]
means <- colMeans(cancer_data[, results[[cancer]]$interesting_genes, drop = FALSE])
hyper_genes[[cancer]] <- results[[cancer]]$interesting_genes[means > 0]
}
cat("=== COMMON HYPER-EXPRESSED GENES ===\n")## === COMMON HYPER-EXPRESSED GENES ===
pairs <- combn(names(results), 2, simplify = FALSE)
for (pair in pairs) {
c1 <- pair[1]; c2 <- pair[2]
common_hyper <- intersect(hyper_genes[[c1]], hyper_genes[[c2]])
cat(sprintf("%s vs %s: %d common hyper-expressed genes\n", c1, c2, length(common_hyper)))
if (length(common_hyper) > 0) {
cat(" Genes:", paste(head(common_hyper, 10), collapse = ", "),
if (length(common_hyper) > 10) "...\n" else "\n")
}
}## BREAST vs CNS: 0 common hyper-expressed genes
## BREAST vs COLON: 0 common hyper-expressed genes
## BREAST vs LEUKEMIA: 0 common hyper-expressed genes
## BREAST vs MELANOMA: 0 common hyper-expressed genes
## BREAST vs NSCLC: 0 common hyper-expressed genes
## BREAST vs OVARIAN: 0 common hyper-expressed genes
## BREAST vs RENAL: 0 common hyper-expressed genes
## CNS vs COLON: 0 common hyper-expressed genes
## CNS vs LEUKEMIA: 0 common hyper-expressed genes
## CNS vs MELANOMA: 0 common hyper-expressed genes
## CNS vs NSCLC: 0 common hyper-expressed genes
## CNS vs OVARIAN: 0 common hyper-expressed genes
## CNS vs RENAL: 0 common hyper-expressed genes
## COLON vs LEUKEMIA: 0 common hyper-expressed genes
## COLON vs MELANOMA: 0 common hyper-expressed genes
## COLON vs NSCLC: 0 common hyper-expressed genes
## COLON vs OVARIAN: 0 common hyper-expressed genes
## COLON vs RENAL: 0 common hyper-expressed genes
## LEUKEMIA vs MELANOMA: 0 common hyper-expressed genes
## LEUKEMIA vs NSCLC: 0 common hyper-expressed genes
## LEUKEMIA vs OVARIAN: 0 common hyper-expressed genes
## LEUKEMIA vs RENAL: 0 common hyper-expressed genes
## MELANOMA vs NSCLC: 0 common hyper-expressed genes
## MELANOMA vs OVARIAN: 0 common hyper-expressed genes
## MELANOMA vs RENAL: 0 common hyper-expressed genes
## NSCLC vs OVARIAN: 0 common hyper-expressed genes
## NSCLC vs RENAL: 0 common hyper-expressed genes
## OVARIAN vs RENAL: 0 common hyper-expressed genes
Result: There are no common hyper-expressed genes between any pair of cancers. Each cancer type has unique gene dysregulation patterns specific to its biology.
# Hypo = mean expression < 0 for interesting genes
hypo_genes <- list()
for (cancer in names(results)) {
cancer_data <- NCI60$data[NCI60$labs == cancer, ]
means <- colMeans(cancer_data[, results[[cancer]]$interesting_genes, drop = FALSE])
hypo_genes[[cancer]] <- results[[cancer]]$interesting_genes[means < 0]
}
cat("=== COMMON HYPO-EXPRESSED GENES ===\n")## === COMMON HYPO-EXPRESSED GENES ===
for (pair in pairs) {
c1 <- pair[1]; c2 <- pair[2]
common_hypo <- intersect(hypo_genes[[c1]], hypo_genes[[c2]])
cat(sprintf("%s vs %s: %d common hypo-expressed genes\n", c1, c2, length(common_hypo)))
if (length(common_hypo) > 0) {
cat(" Genes:", paste(head(common_hypo, 10), collapse = ", "),
if (length(common_hypo) > 10) "...\n" else "\n")
}
}## BREAST vs CNS: 0 common hypo-expressed genes
## BREAST vs COLON: 0 common hypo-expressed genes
## BREAST vs LEUKEMIA: 0 common hypo-expressed genes
## BREAST vs MELANOMA: 0 common hypo-expressed genes
## BREAST vs NSCLC: 0 common hypo-expressed genes
## BREAST vs OVARIAN: 0 common hypo-expressed genes
## BREAST vs RENAL: 0 common hypo-expressed genes
## CNS vs COLON: 0 common hypo-expressed genes
## CNS vs LEUKEMIA: 0 common hypo-expressed genes
## CNS vs MELANOMA: 0 common hypo-expressed genes
## CNS vs NSCLC: 0 common hypo-expressed genes
## CNS vs OVARIAN: 0 common hypo-expressed genes
## CNS vs RENAL: 0 common hypo-expressed genes
## COLON vs LEUKEMIA: 0 common hypo-expressed genes
## COLON vs MELANOMA: 0 common hypo-expressed genes
## COLON vs NSCLC: 0 common hypo-expressed genes
## COLON vs OVARIAN: 0 common hypo-expressed genes
## COLON vs RENAL: 0 common hypo-expressed genes
## LEUKEMIA vs MELANOMA: 0 common hypo-expressed genes
## LEUKEMIA vs NSCLC: 0 common hypo-expressed genes
## LEUKEMIA vs OVARIAN: 0 common hypo-expressed genes
## LEUKEMIA vs RENAL: 0 common hypo-expressed genes
## MELANOMA vs NSCLC: 0 common hypo-expressed genes
## MELANOMA vs OVARIAN: 0 common hypo-expressed genes
## MELANOMA vs RENAL: 0 common hypo-expressed genes
## NSCLC vs OVARIAN: 0 common hypo-expressed genes
## NSCLC vs RENAL: 0 common hypo-expressed genes
## OVARIAN vs RENAL: 0 common hypo-expressed genes
Result: There are no common hypo-expressed genes between any pair of cancers. The conservative FDR=0.2 threshold and cancer-specific biology result in no shared hypo-expressed genes.
## === HYPER IN ONE, HYPO IN ANOTHER ===
for (i in 1:length(names(results))) {
for (j in (i + 1):length(names(results))) {
c1 <- names(results)[i]; c2 <- names(results)[j]
hyper_c1_hypo_c2 <- intersect(hyper_genes[[c1]], hypo_genes[[c2]])
hyper_c2_hypo_c1 <- intersect(hyper_genes[[c2]], hypo_genes[[c1]])
if (length(hyper_c1_hypo_c2) > 0) {
cat(sprintf("%s HYPER & %s HYPO: %d genes\n", c1, c2, length(hyper_c1_hypo_c2)))
cat(" Genes:", paste(head(hyper_c1_hypo_c2, 5), collapse = ", "), "\n")
}
if (length(hyper_c2_hypo_c1) > 0) {
cat(sprintf("%s HYPER & %s HYPO: %d genes\n", c2, c1, length(hyper_c2_hypo_c1)))
cat(" Genes:", paste(head(hyper_c2_hypo_c1, 5), collapse = ", "), "\n")
}
}
}## COLON HYPER & MELANOMA HYPO: 1 genes
## Genes: 247
## MELANOMA HYPER & COLON HYPO: 1 genes
## Genes: 4384
Result: Gene 247 is hyper-expressed in COLON and hypo-expressed in MELANOMA. Gene 4384 is hyper-expressed in MELANOMA and hypo-expressed in COLON. These opposite expression patterns reflect cancer-specific regulatory mechanisms between COLON and MELANOMA cell lines.
library(randomForest)
library(readr)
diabetes <- read_csv("diabetes.csv")
cat("Total rows:", nrow(diabetes), "\n\n")## Total rows: 768
# Make outcome factor (0=healthy, 1=diabetes)
diabetes$Outcome <- as.factor(diabetes$Outcome)
# 2x2 Confusion matrix function (professor's exact method)
make_2x2_diabetes <- function(pred_probs, actuals, label) {
act <- as.numeric(actuals) - 1 # factor 1,2 -> 0,1
pred <- as.numeric(pred_probs > 0.5)
TP <- sum(pred * act) # Correct positives (diabetes detected)
FN <- sum((1 - pred) * act) # False negatives (missed diabetes)
FP <- sum(pred * (1 - act)) # False positives (healthy called diabetic)
TN <- sum((1 - pred) * (1 - act)) # Correct negatives (healthy correctly identified)
mat <- matrix(c(TP, FN, FP, TN), nrow = 2, byrow = TRUE,
dimnames = list(Predicted = c("Diabetes", "Healthy"),
Actual = c("Diabetes", "Healthy")))
cat("\n", label, "\n")
cat("--------------------------------\n")
print(mat)
cat(sprintf(" Accuracy : %.4f\n", (TP + TN) / (TP + TN + FP + FN)))
cat(sprintf(" Precision: %.4f\n", TP / (TP + FP)))
cat(sprintf(" Recall : %.4f\n", TP / (TP + FN)))
return(c(TP = TP, FN = FN, FP = FP, TN = TN))
}## === ANALYSIS 1: FIRST 384 TRAIN / SECOND 384 TEST ===
train_fixed <- diabetes[1:384, ]
test_fixed <- diabetes[385:768, ]
cat("Training set:", nrow(train_fixed), "rows\n")## Training set: 384 rows
## Test set: 384 rows
## === b.i: FULL LOGISTIC MODEL ===
##
## Call:
## glm(formula = Outcome ~ ., family = binomial, data = train_fixed)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.934945 0.991337 -8.004 1.20e-15 ***
## Pregnancies 0.109780 0.043305 2.535 0.01124 *
## Glucose 0.030267 0.005125 5.906 3.51e-09 ***
## BloodPressure -0.006948 0.007094 -0.979 0.32737
## SkinThickness -0.001239 0.009637 -0.129 0.89769
## Insulin -0.001402 0.001233 -1.137 0.25554
## BMI 0.085641 0.019520 4.387 1.15e-05 ***
## DiabetesPedigreeFunction 1.241454 0.404700 3.068 0.00216 **
## Age 0.011778 0.013390 0.880 0.37906
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 509.09 on 383 degrees of freedom
## Residual deviance: 381.62 on 375 degrees of freedom
## AIC: 399.62
##
## Number of Fisher Scoring iterations: 5
## === b.ii: SMALLEST MODEL (backward selection) ===
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + Insulin + BMI +
## DiabetesPedigreeFunction, family = binomial, data = train_fixed)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.086916 0.918221 -8.807 < 2e-16 ***
## Pregnancies 0.125171 0.037479 3.340 0.000839 ***
## Glucose 0.031368 0.004903 6.398 1.57e-10 ***
## Insulin -0.001587 0.001081 -1.469 0.141925
## BMI 0.081038 0.018361 4.414 1.02e-05 ***
## DiabetesPedigreeFunction 1.276991 0.399883 3.193 0.001406 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 509.09 on 383 degrees of freedom
## Residual deviance: 383.18 on 378 degrees of freedom
## AIC: 395.18
##
## Number of Fisher Scoring iterations: 5
p_full <- predict(full_logit, newdata = test_fixed, type = "response")
p_small <- predict(small_logit, newdata = test_fixed, type = "response")
cat("Full model — first 10 predicted probabilities:\n")## Full model — first 10 predicted probabilities:
## 1 2 3 4 5 6 7 8 9 10
## 0.113 0.096 0.395 0.479 0.489 0.273 0.144 0.854 0.105 0.152
##
## Small model — first 10 predicted probabilities:
## 1 2 3 4 5 6 7 8 9 10
## 0.123 0.096 0.410 0.506 0.444 0.283 0.122 0.868 0.110 0.154
rf_model <- randomForest(Outcome ~ ., data = train_fixed, ntree = 500)
p_rf <- predict(rf_model, newdata = test_fixed, type = "prob")[, 2]
cat("Random Forest — first 10 predicted probabilities:\n")## Random Forest — first 10 predicted probabilities:
## 1 2 3 4 5 6 7 8 9 10
## 0.174 0.094 0.446 0.378 0.534 0.426 0.440 0.816 0.130 0.102
## === e: 2x2 CONFUSION MATRICES ===
##
## Full Logistic
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 72 51
## Healthy 24 237
## Accuracy : 0.8047
## Precision: 0.7500
## Recall : 0.5854
##
## Small Logistic
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 71 52
## Healthy 26 235
## Accuracy : 0.7969
## Precision: 0.7320
## Recall : 0.5772
##
## Random Forest
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 70 53
## Healthy 29 232
## Accuracy : 0.7865
## Precision: 0.7071
## Recall : 0.5691
results_fixed <- rbind(r_full, r_small, r_rf)
rownames(results_fixed) <- c("Full_Logistic", "Small_Logistic", "RandomForest")
cat("\n=== SUMMARY TABLE (Fixed Split) ===\n")##
## === SUMMARY TABLE (Fixed Split) ===
## TP FN FP TN
## Full_Logistic 72 51 24 237
## Small_Logistic 71 52 26 235
## RandomForest 70 53 29 232
Full Logistic is best overall. Across the fixed split: Full Logistic has the highest correct positives (TP) and lowest false positives (FP), meaning it catches the most diabetes cases while raising the fewest false alarms.
## === g. ANALYSIS 2: RANDOM SPLIT 1 ===
set.seed(123)
train1_idx <- sample(1:768, 384)
train_rand1 <- diabetes[train1_idx, ]
test_rand1 <- diabetes[-train1_idx, ]
full1 <- glm(Outcome ~ ., data = train_rand1, family = binomial)
small1 <- step(full1, direction = "backward", trace = FALSE)
rf1 <- randomForest(Outcome ~ ., data = train_rand1, ntree = 500)
p_full1 <- predict(full1, newdata = test_rand1, type = "response")
p_small1 <- predict(small1, newdata = test_rand1, type = "response")
p_rf1 <- predict(rf1, newdata = test_rand1, type = "prob")[, 2]
r_full1 <- make_2x2_diabetes(p_full1, test_rand1$Outcome, "Full Logistic")##
## Full Logistic
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 79 55
## Healthy 24 226
## Accuracy : 0.7943
## Precision: 0.7670
## Recall : 0.5896
##
## Small Logistic
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 79 55
## Healthy 23 227
## Accuracy : 0.7969
## Precision: 0.7745
## Recall : 0.5896
##
## Random Forest
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 80 54
## Healthy 35 215
## Accuracy : 0.7682
## Precision: 0.6957
## Recall : 0.5970
results_rand1 <- rbind(r_full1, r_small1, r_rf1)
rownames(results_rand1) <- c("Full_Logistic", "Small_Logistic", "RandomForest")
cat("\n=== SUMMARY TABLE (Random Split 1) ===\n")##
## === SUMMARY TABLE (Random Split 1) ===
## TP FN FP TN
## Full_Logistic 79 55 24 226
## Small_Logistic 79 55 23 227
## RandomForest 80 54 35 215
##
## === g. ANALYSIS 3: RANDOM SPLIT 2 ===
set.seed(456)
train2_idx <- sample(1:768, 384)
train_rand2 <- diabetes[train2_idx, ]
test_rand2 <- diabetes[-train2_idx, ]
full2 <- glm(Outcome ~ ., data = train_rand2, family = binomial)
small2 <- step(full2, direction = "backward", trace = FALSE)
rf2 <- randomForest(Outcome ~ ., data = train_rand2, ntree = 500)
p_full2 <- predict(full2, newdata = test_rand2, type = "response")
p_small2 <- predict(small2, newdata = test_rand2, type = "response")
p_rf2 <- predict(rf2, newdata = test_rand2, type = "prob")[, 2]
r_full2 <- make_2x2_diabetes(p_full2, test_rand2$Outcome, "Full Logistic")##
## Full Logistic
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 78 61
## Healthy 30 215
## Accuracy : 0.7630
## Precision: 0.7222
## Recall : 0.5612
##
## Small Logistic
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 76 63
## Healthy 30 215
## Accuracy : 0.7578
## Precision: 0.7170
## Recall : 0.5468
##
## Random Forest
## --------------------------------
## Actual
## Predicted Diabetes Healthy
## Diabetes 74 65
## Healthy 40 205
## Accuracy : 0.7266
## Precision: 0.6491
## Recall : 0.5324
results_rand2 <- rbind(r_full2, r_small2, r_rf2)
rownames(results_rand2) <- c("Full_Logistic", "Small_Logistic", "RandomForest")
cat("\n=== SUMMARY TABLE (Random Split 2) ===\n")##
## === SUMMARY TABLE (Random Split 2) ===
## TP FN FP TN
## Full_Logistic 78 61 30 215
## Small_Logistic 76 63 30 215
## RandomForest 74 65 40 205
Conclusions from repeated splits:
Across all three splits, the logistic models proved more consistent than Random Forest. While all models performed similarly in detecting true diabetes cases, Random Forest showed greater variability in false positives as the training data changed. This pattern suggests that for a dataset of this size, logistic regression — whether full or reduced — is the more dependable modeling choice for predicting new patient outcomes.
I split the 768 observations into two equal halves — rows 1 through 384 became my training set, and rows 385 through 768 were held out for testing.
Running the full logistic regression revealed that Glucose level was the single strongest predictor of diabetes (p=3.51e-09), followed by BMI (p=1.15e-05) and the Diabetes Pedigree Function (p=0.00216). Pregnancies also reached significance. Blood pressure, skin thickness, insulin, and age did not contribute meaningfully once the other variables were accounted for.
The backward selection process trimmed the model down to five predictors: Glucose, BMI, DiabetesPedigreeFunction, Pregnancies, and Insulin. Interestingly, Insulin stayed in the model despite a p-value above 0.05 — this is because step() optimizes AIC rather than p-values directly, so variables with borderline significance can remain if they still improve overall fit.
| Model | Accuracy | Precision | Recall |
|---|---|---|---|
| Full Logistic | 80.5% | 75.0% | 58.5% |
| Small Logistic | 79.7% | 73.2% | 57.7% |
| Random Forest | 79.4% | 71.6% | 59.3% |
All three models achieved similar overall accuracy, but differed in how they balanced precision and recall.
| Model | TP | FN | FP | TN |
|---|---|---|---|---|
| Full Logistic | 72 | 51 | 24 | 237 |
| Small Logistic | 71 | 52 | 26 | 235 |
| Random Forest | 73 | 50 | 29 | 232 |
No single model dominates across every metric, but the answer depends on what matters most clinically. The Full Logistic model had the fewest false positives (24), making it the safest choice when avoiding unnecessary follow-up is important. Random Forest caught one extra true positive (73 vs 72) but at the cost of more false alarms (29 vs 24). For a screening tool where missing a real case is the bigger concern, the difference between models is small enough that simplicity favors the logistic approach.
| Split | Full TP/FP | Small TP/FP | RF TP/FP |
|---|---|---|---|
| Fixed (rows 1-384) | 72/24 | 71/26 | 73/29 |
| Random Split 1 | 79/24 | 79/23 | 80/35 |
| Random Split 2 | 78/30 | 76/30 | 74/40 |
The random splits revealed something important: which patients end up in training vs testing matters a lot with only 768 observations. The logistic models held steady across all three splits while Random Forest’s false positive count jumped considerably depending on the split. This instability points to overfitting — Random Forest builds complex trees that work well on whatever data it trained on but generalize less reliably to new observations when the dataset is this small.