Problem 1 — Sample Size for School Vote (25 Points)

A school has 1,600 students voting on whether to convert the school off fossil fuels. How many students would you need to poll to be 95% confident of the outcome within ±0.6% of the vote?

Solution (by hand / formula)

For estimating a proportion with a finite population, the professor’s formula from class (slide 4) is:

\[n = \frac{N}{\left(\frac{ME}{z_{\alpha/2} \cdot \sigma}\right)^2 (N-1) + 1}\]

where \(z = 1.96\) (95% confidence), \(\sigma = \sqrt{p(1-p)} = 0.5\) (maximum variance, most conservative with \(p=0.5\)), \(ME = 0.006\) (±0.6%), and \(N = 1600\).

# Parameters
z   <- 1.96    # 95% confidence z-score
sig <- 0.5     # sigma = sqrt(p*(1-p)), maximized at p = 0.5
ME  <- 0.006   # margin of error = 0.6%
N   <- 1600    # population size

# Direct formula from class slides:
# n = N / ( (ME / (z * sigma))^2 * (N-1) + 1 )
n_finite <- N / ((ME / (z * sig))^2 * (N - 1) + 1)
cat("Finite-population n  =", n_finite, "\n")
## Finite-population n  = 1509.523
cat("Round up to          =", ceiling(n_finite), "\n")
## Round up to          = 1510

Answer

The required sample size is 1510 students (closest answer choice: 1510).

Hand Calculation


Problem 2 — Earthquake Independence: Same Type (20 + 15 Points)

Test whether earthquake occurrence in one time period is statistically independent of occurrence in the next period (same earthquake type). Test at α = 0.01.

Observed Table

Earthquake (t+1) No Earthquake (t+1) Row Total
Earthquake (t) 148 274 422
No Earthquake (t) 276 2626 2902
Column Total 424 2900 3324

Chi-Square Test

# Observed frequency matrix
obs <- matrix(c(148, 274, 276, 2626), nrow = 2, byrow = TRUE,
              dimnames = list(
                c("EQ at t", "No EQ at t"),
                c("EQ at t+1", "No EQ at t+1")
              ))

cat("Observed frequencies:\n")
## Observed frequencies:
print(obs)
##            EQ at t+1 No EQ at t+1
## EQ at t          148          274
## No EQ at t       276         2626
# Chi-square test
result <- chisq.test(obs, correct = FALSE)
print(result)
## 
##  Pearson's Chi-squared test
## 
## data:  obs
## X-squared = 216.29, df = 1, p-value < 2.2e-16

Expected Frequencies

cat("Expected frequencies under independence:\n")
## Expected frequencies under independence:
print(round(result$expected, 3))
##            EQ at t+1 No EQ at t+1
## EQ at t       53.829      368.171
## No EQ at t   370.171     2531.829

Deviation Table (Observed − Expected)

deviations <- obs - result$expected
cat("Deviation table (Observed - Expected):\n")
## Deviation table (Observed - Expected):
print(round(deviations, 3))
##            EQ at t+1 No EQ at t+1
## EQ at t       94.171      -94.171
## No EQ at t   -94.171       94.171

Interpretation

cat("Chi-square statistic:", round(result$statistic, 4), "\n")
## Chi-square statistic: 216.2931
cat("Degrees of freedom:  ", result$parameter, "\n")
## Degrees of freedom:   1
cat("p-value:             ", format(result$p.value, scientific = TRUE), "\n")
## p-value:              5.820944e-49
cat("Critical value (α=0.01, df=1):", round(qchisq(0.99, df=1), 4), "\n")
## Critical value (α=0.01, df=1): 6.6349
cat("\nDecision:", ifelse(result$p.value < 0.01, 
    "REJECT H0 — earthquakes are NOT independent across periods.",
    "Fail to reject H0 — no evidence of dependence."), "\n")
## 
## Decision: REJECT H0 — earthquakes are NOT independent across periods.

Cells with higher-than-expected occurrence (positive deviations): The diagonal cells — (EQ → EQ) and (No EQ → No EQ) — show positive deviations, meaning earthquakes tend to cluster: an earthquake period is more likely to be followed by another earthquake, and a quiet period tends to follow a quiet period.

Hand Calculation


Problem 3 — Earthquake Independence: Different Types (20 + 10 + 10 Points)

Same test, now for earthquakes of different types following one another (α = 0.01).

Observed Table

EQ (t+1) No EQ (t+1) Row Total
EQ (t) 5 314 319
No EQ (t) 314 2691 3005
Column Total 319 3005 3324

Chi-Square Test

obs2 <- matrix(c(5, 314, 314, 2691), nrow = 2, byrow = TRUE,
               dimnames = list(
                 c("EQ at t", "No EQ at t"),
                 c("EQ at t+1", "No EQ at t+1")
               ))

cat("Observed frequencies:\n")
## Observed frequencies:
print(obs2)
##            EQ at t+1 No EQ at t+1
## EQ at t            5          314
## No EQ at t       314         2691
result2 <- chisq.test(obs2, correct = FALSE)
print(result2)
## 
##  Pearson's Chi-squared test
## 
## data:  obs2
## X-squared = 26.222, df = 1, p-value = 3.043e-07

Deviation Table

deviations2 <- obs2 - result2$expected
cat("Deviation table (Observed - Expected):\n")
## Deviation table (Observed - Expected):
print(round(deviations2, 3))
##            EQ at t+1 No EQ at t+1
## EQ at t      -25.614       25.614
## No EQ at t    25.614      -25.614
cat("\nChi-square statistic:", round(result2$statistic, 4), "\n")
## 
## Chi-square statistic: 26.2221
cat("p-value:             ", format(result2$p.value, scientific = TRUE), "\n")
## p-value:              3.04313e-07
cat("\nDecision:", ifelse(result2$p.value < 0.01,
    "REJECT H0 — different-type earthquakes are NOT independent.",
    "Fail to reject H0."), "\n")
## 
## Decision: REJECT H0 — different-type earthquakes are NOT independent.

Comparison of Deviation Patterns

Cell Same Type (Prob. 2) Different Type (Prob. 3)
EQ → EQ + (more than expected) (less than expected)
EQ → No EQ +
No EQ → EQ +
No EQ → No EQ +

Pattern reversal: In Problem 2 (same type), the diagonal cells are inflated — earthquakes of the same type cluster together. In Problem 3 (different types), the off-diagonal cells are inflated — a same-type earthquake is less likely to be followed by a different-type earthquake than independence would predict.

Physical interpretation: The two fault-slip types appear to be mutually inhibitory. When a fault ruptures in one direction (say, strike-slip), it relieves stress in a way that makes an immediate rupture in the other direction (dip-slip) less likely. Conversely, consecutive quiet periods remain common. This suggests the two mechanisms draw on the same regional stress budget but suppress each other in the short term.

Hand Calculation


Problem 4 — NCI60 Gene Expression Analysis (100 Points)

library(ISLR)
data(NCI60)

Part (a) — Cancer Types with More Than 3 Cell Lines (10 Points)

table(NCI60$labs)
## 
##      BREAST         CNS       COLON K562A-repro K562B-repro    LEUKEMIA 
##           7           5           7           1           1           6 
## MCF7A-repro MCF7D-repro    MELANOMA       NSCLC     OVARIAN    PROSTATE 
##           1           1           8           9           6           2 
##       RENAL     UNKNOWN 
##           9           1
cancer_counts <- table(NCI60$labs)
cancers_large <- names(cancer_counts[cancer_counts > 3])
cat("Cancers with >3 cell lines:\n")
## Cancers with >3 cell lines:
print(cancers_large)
## [1] "BREAST"   "CNS"      "COLON"    "LEUKEMIA" "MELANOMA" "NSCLC"    "OVARIAN" 
## [8] "RENAL"

Cancers with more than 3 cell lines: BREAST (7), CNS (5), COLON (7), LEUKEMIA (6), MELANOMA (8), NSCLC (9), OVARIAN (6), RENAL (9).

Part (b) — FDR Analysis for Hyper/Hypo Active Genes (40 Points)

# Using the professor's exact fdr.pck code provided in class — this includes the fdr() function and myt() one-sample t-test against 0. The not-independent correction (ind=FALSE) is applied as required.
fdr <- function(v1, Q, ind = F) {
  o1   <- order(v1)
  pvec <- v1[o1]
  m     <- length(v1)
  qline <- Q * c(1:m) / m
  if (!ind) {
    c1    <- sum(1 / (c(1:m)))
    qline <- Q * c(1:m) / (m * c1)
  }
  plot(c(c(1:m), c(1:m)), c(qline, pvec), type = "n",
       xlab = "ordering", ylab = "pvalue")
  lines(c(1:m), qline)
  points(c(1:m), pvec)
  dv  <- pvec - qline
  I22 <- !is.na(dv)
  I1  <- (dv[I22] < 0)
  I0  <- I1
  if (sum(I0) > .5) {
    pmax <- max(pvec[I22][I1])
    I2   <- pvec[I22] <= pmax
    points(c(1:m)[I2], pvec[I2], col = "red")
    out <- list(interesting = o1[I22][I2], ind = ind)
  } else {
    vec <- qbeta(c(.5, .95, .99, .999), 1, length(v1) + 1)
    out <- list(q.5 = vec[1], q.95 = vec[2], q.99 = vec[3], q.999 = vec[4])
  }
  out
}

# myt: one-sample t-test against 0 (professor's exact method)
myt <- function(x) {
  t.test(x, mu = 0)$p.value
}

# Test each cancer type — subset JUST that cancer's data, apply myt per gene
results <- list()
for (cancer in cancers_large) {
  cat("Testing", cancer, "...\n")
  
  cancer_data <- NCI60$data[NCI60$labs == cancer, ]
  pvals <- apply(cancer_data, 2, myt)
  fdr_result <- fdr(pvals, Q = 0.2, ind = FALSE)
  
  if (is.logical(fdr_result$interesting)) {
    interesting <- which(fdr_result$interesting)
  } else {
    interesting <- fdr_result$interesting
  }
  
  results[[cancer]] <- list(
    cancer            = cancer,
    n_samples         = nrow(cancer_data),
    interesting_genes = interesting,
    n_interesting     = length(interesting),
    pvals             = pvals
  )
  cat("  Found", length(interesting), "interesting genes\n\n")
}
## Testing BREAST ...

##   Found 0 interesting genes
## 
## Testing CNS ...

##   Found 0 interesting genes
## 
## Testing COLON ...

##   Found 48 interesting genes
## 
## Testing LEUKEMIA ...

##   Found 25 interesting genes
## 
## Testing MELANOMA ...

##   Found 41 interesting genes
## 
## Testing NSCLC ...

##   Found 0 interesting genes
## 
## Testing OVARIAN ...

##   Found 0 interesting genes
## 
## Testing RENAL ...

##   Found 10 interesting genes
# Summary of results
cat("=== RESULTS AT FDR Q=0.2 (ind=FALSE) ===\n")
## === RESULTS AT FDR Q=0.2 (ind=FALSE) ===
for (cancer in names(results)) {
  cat(sprintf("%-10s: %d genes (n=%d samples)\n",
              cancer, results[[cancer]]$n_interesting, results[[cancer]]$n_samples))
}
## BREAST    : 0 genes (n=7 samples)
## CNS       : 0 genes (n=5 samples)
## COLON     : 48 genes (n=7 samples)
## LEUKEMIA  : 25 genes (n=6 samples)
## MELANOMA  : 41 genes (n=8 samples)
## NSCLC     : 0 genes (n=9 samples)
## OVARIAN   : 0 genes (n=6 samples)
## RENAL     : 10 genes (n=9 samples)

Cancers with hyper/hypo active genes at FDR Q=0.2 (not independent): COLON (48 genes), MELANOMA (41 genes), LEUKEMIA (25 genes), RENAL (10 genes). BREAST, CNS, NSCLC, and OVARIAN had no significant genes at this threshold.

Part (c) — Common Hyper-Expressed Genes Between Each Pair (50 Points)

# Hyper = mean expression > 0 for interesting genes
hyper_genes <- list()
for (cancer in names(results)) {
  cancer_data <- NCI60$data[NCI60$labs == cancer, ]
  means <- colMeans(cancer_data[, results[[cancer]]$interesting_genes, drop = FALSE])
  hyper_genes[[cancer]] <- results[[cancer]]$interesting_genes[means > 0]
}

cat("=== COMMON HYPER-EXPRESSED GENES ===\n")
## === COMMON HYPER-EXPRESSED GENES ===
pairs <- combn(names(results), 2, simplify = FALSE)
for (pair in pairs) {
  c1 <- pair[1]; c2 <- pair[2]
  common_hyper <- intersect(hyper_genes[[c1]], hyper_genes[[c2]])
  cat(sprintf("%s vs %s: %d common hyper-expressed genes\n", c1, c2, length(common_hyper)))
  if (length(common_hyper) > 0) {
    cat("  Genes:", paste(head(common_hyper, 10), collapse = ", "),
        if (length(common_hyper) > 10) "...\n" else "\n")
  }
}
## BREAST vs CNS: 0 common hyper-expressed genes
## BREAST vs COLON: 0 common hyper-expressed genes
## BREAST vs LEUKEMIA: 0 common hyper-expressed genes
## BREAST vs MELANOMA: 0 common hyper-expressed genes
## BREAST vs NSCLC: 0 common hyper-expressed genes
## BREAST vs OVARIAN: 0 common hyper-expressed genes
## BREAST vs RENAL: 0 common hyper-expressed genes
## CNS vs COLON: 0 common hyper-expressed genes
## CNS vs LEUKEMIA: 0 common hyper-expressed genes
## CNS vs MELANOMA: 0 common hyper-expressed genes
## CNS vs NSCLC: 0 common hyper-expressed genes
## CNS vs OVARIAN: 0 common hyper-expressed genes
## CNS vs RENAL: 0 common hyper-expressed genes
## COLON vs LEUKEMIA: 0 common hyper-expressed genes
## COLON vs MELANOMA: 0 common hyper-expressed genes
## COLON vs NSCLC: 0 common hyper-expressed genes
## COLON vs OVARIAN: 0 common hyper-expressed genes
## COLON vs RENAL: 0 common hyper-expressed genes
## LEUKEMIA vs MELANOMA: 0 common hyper-expressed genes
## LEUKEMIA vs NSCLC: 0 common hyper-expressed genes
## LEUKEMIA vs OVARIAN: 0 common hyper-expressed genes
## LEUKEMIA vs RENAL: 0 common hyper-expressed genes
## MELANOMA vs NSCLC: 0 common hyper-expressed genes
## MELANOMA vs OVARIAN: 0 common hyper-expressed genes
## MELANOMA vs RENAL: 0 common hyper-expressed genes
## NSCLC vs OVARIAN: 0 common hyper-expressed genes
## NSCLC vs RENAL: 0 common hyper-expressed genes
## OVARIAN vs RENAL: 0 common hyper-expressed genes

Result: There are no common hyper-expressed genes between any pair of cancers. Each cancer type has unique gene dysregulation patterns specific to its biology.

Part (d) — Common Hypo-Expressed Genes Between Each Pair

# Hypo = mean expression < 0 for interesting genes
hypo_genes <- list()
for (cancer in names(results)) {
  cancer_data <- NCI60$data[NCI60$labs == cancer, ]
  means <- colMeans(cancer_data[, results[[cancer]]$interesting_genes, drop = FALSE])
  hypo_genes[[cancer]] <- results[[cancer]]$interesting_genes[means < 0]
}

cat("=== COMMON HYPO-EXPRESSED GENES ===\n")
## === COMMON HYPO-EXPRESSED GENES ===
for (pair in pairs) {
  c1 <- pair[1]; c2 <- pair[2]
  common_hypo <- intersect(hypo_genes[[c1]], hypo_genes[[c2]])
  cat(sprintf("%s vs %s: %d common hypo-expressed genes\n", c1, c2, length(common_hypo)))
  if (length(common_hypo) > 0) {
    cat("  Genes:", paste(head(common_hypo, 10), collapse = ", "),
        if (length(common_hypo) > 10) "...\n" else "\n")
  }
}
## BREAST vs CNS: 0 common hypo-expressed genes
## BREAST vs COLON: 0 common hypo-expressed genes
## BREAST vs LEUKEMIA: 0 common hypo-expressed genes
## BREAST vs MELANOMA: 0 common hypo-expressed genes
## BREAST vs NSCLC: 0 common hypo-expressed genes
## BREAST vs OVARIAN: 0 common hypo-expressed genes
## BREAST vs RENAL: 0 common hypo-expressed genes
## CNS vs COLON: 0 common hypo-expressed genes
## CNS vs LEUKEMIA: 0 common hypo-expressed genes
## CNS vs MELANOMA: 0 common hypo-expressed genes
## CNS vs NSCLC: 0 common hypo-expressed genes
## CNS vs OVARIAN: 0 common hypo-expressed genes
## CNS vs RENAL: 0 common hypo-expressed genes
## COLON vs LEUKEMIA: 0 common hypo-expressed genes
## COLON vs MELANOMA: 0 common hypo-expressed genes
## COLON vs NSCLC: 0 common hypo-expressed genes
## COLON vs OVARIAN: 0 common hypo-expressed genes
## COLON vs RENAL: 0 common hypo-expressed genes
## LEUKEMIA vs MELANOMA: 0 common hypo-expressed genes
## LEUKEMIA vs NSCLC: 0 common hypo-expressed genes
## LEUKEMIA vs OVARIAN: 0 common hypo-expressed genes
## LEUKEMIA vs RENAL: 0 common hypo-expressed genes
## MELANOMA vs NSCLC: 0 common hypo-expressed genes
## MELANOMA vs OVARIAN: 0 common hypo-expressed genes
## MELANOMA vs RENAL: 0 common hypo-expressed genes
## NSCLC vs OVARIAN: 0 common hypo-expressed genes
## NSCLC vs RENAL: 0 common hypo-expressed genes
## OVARIAN vs RENAL: 0 common hypo-expressed genes

Result: There are no common hypo-expressed genes between any pair of cancers. The conservative FDR=0.2 threshold and cancer-specific biology result in no shared hypo-expressed genes.

Part (e) — Genes Hyper in One Cancer, Hypo in Another

cat("=== HYPER IN ONE, HYPO IN ANOTHER ===\n")
## === HYPER IN ONE, HYPO IN ANOTHER ===
for (i in 1:length(names(results))) {
  for (j in (i + 1):length(names(results))) {
    c1 <- names(results)[i]; c2 <- names(results)[j]
    
    hyper_c1_hypo_c2 <- intersect(hyper_genes[[c1]], hypo_genes[[c2]])
    hyper_c2_hypo_c1 <- intersect(hyper_genes[[c2]], hypo_genes[[c1]])
    
    if (length(hyper_c1_hypo_c2) > 0) {
      cat(sprintf("%s HYPER & %s HYPO: %d genes\n", c1, c2, length(hyper_c1_hypo_c2)))
      cat("  Genes:", paste(head(hyper_c1_hypo_c2, 5), collapse = ", "), "\n")
    }
    if (length(hyper_c2_hypo_c1) > 0) {
      cat(sprintf("%s HYPER & %s HYPO: %d genes\n", c2, c1, length(hyper_c2_hypo_c1)))
      cat("  Genes:", paste(head(hyper_c2_hypo_c1, 5), collapse = ", "), "\n")
    }
  }
}
## COLON HYPER & MELANOMA HYPO: 1 genes
##   Genes: 247 
## MELANOMA HYPER & COLON HYPO: 1 genes
##   Genes: 4384

Result: Gene 247 is hyper-expressed in COLON and hypo-expressed in MELANOMA. Gene 4384 is hyper-expressed in MELANOMA and hypo-expressed in COLON. These opposite expression patterns reflect cancer-specific regulatory mechanisms between COLON and MELANOMA cell lines.


Problem 5 — Diabetes Prediction Models (100 Points)

library(randomForest)
library(readr)

diabetes <- read_csv("diabetes.csv")
cat("Total rows:", nrow(diabetes), "\n\n")
## Total rows: 768
# Make outcome factor (0=healthy, 1=diabetes)
diabetes$Outcome <- as.factor(diabetes$Outcome)

# 2x2 Confusion matrix function (professor's exact method)
make_2x2_diabetes <- function(pred_probs, actuals, label) {
  act  <- as.numeric(actuals) - 1       # factor 1,2 -> 0,1
  pred <- as.numeric(pred_probs > 0.5)
  
  TP <- sum(pred * act)              # Correct positives (diabetes detected)
  FN <- sum((1 - pred) * act)        # False negatives (missed diabetes)
  FP <- sum(pred * (1 - act))        # False positives (healthy called diabetic)
  TN <- sum((1 - pred) * (1 - act))  # Correct negatives (healthy correctly identified)
  
  mat <- matrix(c(TP, FN, FP, TN), nrow = 2, byrow = TRUE,
                dimnames = list(Predicted = c("Diabetes", "Healthy"),
                                Actual    = c("Diabetes", "Healthy")))
  cat("\n", label, "\n")
  cat("--------------------------------\n")
  print(mat)
  cat(sprintf("  Accuracy : %.4f\n", (TP + TN) / (TP + TN + FP + FN)))
  cat(sprintf("  Precision: %.4f\n", TP / (TP + FP)))
  cat(sprintf("  Recall   : %.4f\n", TP / (TP + FN)))
  
  return(c(TP = TP, FN = FN, FP = FP, TN = TN))
}

Part (a) — Train/Test Split: First 384 vs. Last 384

cat("=== ANALYSIS 1: FIRST 384 TRAIN / SECOND 384 TEST ===\n\n")
## === ANALYSIS 1: FIRST 384 TRAIN / SECOND 384 TEST ===
train_fixed <- diabetes[1:384, ]
test_fixed  <- diabetes[385:768, ]

cat("Training set:", nrow(train_fixed), "rows\n")
## Training set: 384 rows
cat("Test set:    ", nrow(test_fixed),  "rows\n")
## Test set:     384 rows

Part (b-i) — Full Logistic Regression Model

cat("=== b.i: FULL LOGISTIC MODEL ===\n")
## === b.i: FULL LOGISTIC MODEL ===
full_logit <- glm(Outcome ~ ., data = train_fixed, family = binomial)
summary(full_logit)
## 
## Call:
## glm(formula = Outcome ~ ., family = binomial, data = train_fixed)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -7.934945   0.991337  -8.004 1.20e-15 ***
## Pregnancies               0.109780   0.043305   2.535  0.01124 *  
## Glucose                   0.030267   0.005125   5.906 3.51e-09 ***
## BloodPressure            -0.006948   0.007094  -0.979  0.32737    
## SkinThickness            -0.001239   0.009637  -0.129  0.89769    
## Insulin                  -0.001402   0.001233  -1.137  0.25554    
## BMI                       0.085641   0.019520   4.387 1.15e-05 ***
## DiabetesPedigreeFunction  1.241454   0.404700   3.068  0.00216 ** 
## Age                       0.011778   0.013390   0.880  0.37906    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 509.09  on 383  degrees of freedom
## Residual deviance: 381.62  on 375  degrees of freedom
## AIC: 399.62
## 
## Number of Fisher Scoring iterations: 5

Part (b-ii) — Backward Selection (p < 0.05)

cat("=== b.ii: SMALLEST MODEL (backward selection) ===\n")
## === b.ii: SMALLEST MODEL (backward selection) ===
small_logit <- step(full_logit, direction = "backward", trace = FALSE)
summary(small_logit)
## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + Insulin + BMI + 
##     DiabetesPedigreeFunction, family = binomial, data = train_fixed)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -8.086916   0.918221  -8.807  < 2e-16 ***
## Pregnancies               0.125171   0.037479   3.340 0.000839 ***
## Glucose                   0.031368   0.004903   6.398 1.57e-10 ***
## Insulin                  -0.001587   0.001081  -1.469 0.141925    
## BMI                       0.081038   0.018361   4.414 1.02e-05 ***
## DiabetesPedigreeFunction  1.276991   0.399883   3.193 0.001406 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 509.09  on 383  degrees of freedom
## Residual deviance: 383.18  on 378  degrees of freedom
## AIC: 395.18
## 
## Number of Fisher Scoring iterations: 5

Part (c) — Predict Response on Test Data

p_full  <- predict(full_logit,  newdata = test_fixed, type = "response")
p_small <- predict(small_logit, newdata = test_fixed, type = "response")

cat("Full model — first 10 predicted probabilities:\n")
## Full model — first 10 predicted probabilities:
print(round(head(p_full, 10), 3))
##     1     2     3     4     5     6     7     8     9    10 
## 0.113 0.096 0.395 0.479 0.489 0.273 0.144 0.854 0.105 0.152
cat("\nSmall model — first 10 predicted probabilities:\n")
## 
## Small model — first 10 predicted probabilities:
print(round(head(p_small, 10), 3))
##     1     2     3     4     5     6     7     8     9    10 
## 0.123 0.096 0.410 0.506 0.444 0.283 0.122 0.868 0.110 0.154

Part (d) — Random Forest Model

rf_model <- randomForest(Outcome ~ ., data = train_fixed, ntree = 500)
p_rf     <- predict(rf_model, newdata = test_fixed, type = "prob")[, 2]

cat("Random Forest — first 10 predicted probabilities:\n")
## Random Forest — first 10 predicted probabilities:
print(round(head(p_rf, 10), 3))
##     1     2     3     4     5     6     7     8     9    10 
## 0.174 0.094 0.446 0.378 0.534 0.426 0.440 0.816 0.130 0.102

Part (e) — 2x2 Confusion Matrices for All 3 Models

cat("=== e: 2x2 CONFUSION MATRICES ===\n")
## === e: 2x2 CONFUSION MATRICES ===
r_full  <- make_2x2_diabetes(p_full,  test_fixed$Outcome, "Full Logistic")
## 
##  Full Logistic 
## --------------------------------
##           Actual
## Predicted  Diabetes Healthy
##   Diabetes       72      51
##   Healthy        24     237
##   Accuracy : 0.8047
##   Precision: 0.7500
##   Recall   : 0.5854
r_small <- make_2x2_diabetes(p_small, test_fixed$Outcome, "Small Logistic")
## 
##  Small Logistic 
## --------------------------------
##           Actual
## Predicted  Diabetes Healthy
##   Diabetes       71      52
##   Healthy        26     235
##   Accuracy : 0.7969
##   Precision: 0.7320
##   Recall   : 0.5772
r_rf    <- make_2x2_diabetes(p_rf,    test_fixed$Outcome, "Random Forest")
## 
##  Random Forest 
## --------------------------------
##           Actual
## Predicted  Diabetes Healthy
##   Diabetes       70      53
##   Healthy        29     232
##   Accuracy : 0.7865
##   Precision: 0.7071
##   Recall   : 0.5691
results_fixed <- rbind(r_full, r_small, r_rf)
rownames(results_fixed) <- c("Full_Logistic", "Small_Logistic", "RandomForest")
cat("\n=== SUMMARY TABLE (Fixed Split) ===\n")
## 
## === SUMMARY TABLE (Fixed Split) ===
print(results_fixed)
##                TP FN FP  TN
## Full_Logistic  72 51 24 237
## Small_Logistic 71 52 26 235
## RandomForest   70 53 29 232

Part (f) — Which Model Performs Best?

Full Logistic is best overall. Across the fixed split: Full Logistic has the highest correct positives (TP) and lowest false positives (FP), meaning it catches the most diabetes cases while raising the fewest false alarms.

  • If identifying diabetes is the priority → Full Logistic (highest TP)
  • If avoiding false alarms is the priority → Full Logistic (lowest FP)
  • Random Forest underperforms here likely due to the small training set (384 rows); Random Forest typically needs larger datasets for its bagging advantage to emerge.

Part (g) — Two Random Splits

# Random Split 1
cat("=== g. ANALYSIS 2: RANDOM SPLIT 1 ===\n")
## === g. ANALYSIS 2: RANDOM SPLIT 1 ===
set.seed(123)
train1_idx  <- sample(1:768, 384)
train_rand1 <- diabetes[train1_idx, ]
test_rand1  <- diabetes[-train1_idx, ]

full1  <- glm(Outcome ~ ., data = train_rand1, family = binomial)
small1 <- step(full1, direction = "backward", trace = FALSE)
rf1    <- randomForest(Outcome ~ ., data = train_rand1, ntree = 500)

p_full1  <- predict(full1,  newdata = test_rand1, type = "response")
p_small1 <- predict(small1, newdata = test_rand1, type = "response")
p_rf1    <- predict(rf1,    newdata = test_rand1, type = "prob")[, 2]

r_full1  <- make_2x2_diabetes(p_full1,  test_rand1$Outcome, "Full Logistic")
## 
##  Full Logistic 
## --------------------------------
##           Actual
## Predicted  Diabetes Healthy
##   Diabetes       79      55
##   Healthy        24     226
##   Accuracy : 0.7943
##   Precision: 0.7670
##   Recall   : 0.5896
r_small1 <- make_2x2_diabetes(p_small1, test_rand1$Outcome, "Small Logistic")
## 
##  Small Logistic 
## --------------------------------
##           Actual
## Predicted  Diabetes Healthy
##   Diabetes       79      55
##   Healthy        23     227
##   Accuracy : 0.7969
##   Precision: 0.7745
##   Recall   : 0.5896
r_rf1    <- make_2x2_diabetes(p_rf1,    test_rand1$Outcome, "Random Forest")
## 
##  Random Forest 
## --------------------------------
##           Actual
## Predicted  Diabetes Healthy
##   Diabetes       80      54
##   Healthy        35     215
##   Accuracy : 0.7682
##   Precision: 0.6957
##   Recall   : 0.5970
results_rand1 <- rbind(r_full1, r_small1, r_rf1)
rownames(results_rand1) <- c("Full_Logistic", "Small_Logistic", "RandomForest")
cat("\n=== SUMMARY TABLE (Random Split 1) ===\n")
## 
## === SUMMARY TABLE (Random Split 1) ===
print(results_rand1)
##                TP FN FP  TN
## Full_Logistic  79 55 24 226
## Small_Logistic 79 55 23 227
## RandomForest   80 54 35 215
# Random Split 2
cat("\n=== g. ANALYSIS 3: RANDOM SPLIT 2 ===\n")
## 
## === g. ANALYSIS 3: RANDOM SPLIT 2 ===
set.seed(456)
train2_idx  <- sample(1:768, 384)
train_rand2 <- diabetes[train2_idx, ]
test_rand2  <- diabetes[-train2_idx, ]

full2  <- glm(Outcome ~ ., data = train_rand2, family = binomial)
small2 <- step(full2, direction = "backward", trace = FALSE)
rf2    <- randomForest(Outcome ~ ., data = train_rand2, ntree = 500)

p_full2  <- predict(full2,  newdata = test_rand2, type = "response")
p_small2 <- predict(small2, newdata = test_rand2, type = "response")
p_rf2    <- predict(rf2,    newdata = test_rand2, type = "prob")[, 2]

r_full2  <- make_2x2_diabetes(p_full2,  test_rand2$Outcome, "Full Logistic")
## 
##  Full Logistic 
## --------------------------------
##           Actual
## Predicted  Diabetes Healthy
##   Diabetes       78      61
##   Healthy        30     215
##   Accuracy : 0.7630
##   Precision: 0.7222
##   Recall   : 0.5612
r_small2 <- make_2x2_diabetes(p_small2, test_rand2$Outcome, "Small Logistic")
## 
##  Small Logistic 
## --------------------------------
##           Actual
## Predicted  Diabetes Healthy
##   Diabetes       76      63
##   Healthy        30     215
##   Accuracy : 0.7578
##   Precision: 0.7170
##   Recall   : 0.5468
r_rf2    <- make_2x2_diabetes(p_rf2,    test_rand2$Outcome, "Random Forest")
## 
##  Random Forest 
## --------------------------------
##           Actual
## Predicted  Diabetes Healthy
##   Diabetes       74      65
##   Healthy        40     205
##   Accuracy : 0.7266
##   Precision: 0.6491
##   Recall   : 0.5324
results_rand2 <- rbind(r_full2, r_small2, r_rf2)
rownames(results_rand2) <- c("Full_Logistic", "Small_Logistic", "RandomForest")
cat("\n=== SUMMARY TABLE (Random Split 2) ===\n")
## 
## === SUMMARY TABLE (Random Split 2) ===
print(results_rand2)
##                TP FN FP  TN
## Full_Logistic  78 61 30 215
## Small_Logistic 76 63 30 215
## RandomForest   74 65 40 205

Conclusions from repeated splits:

Across all three splits, the logistic models proved more consistent than Random Forest. While all models performed similarly in detecting true diabetes cases, Random Forest showed greater variability in false positives as the training data changed. This pattern suggests that for a dataset of this size, logistic regression — whether full or reduced — is the more dependable modeling choice for predicting new patient outcomes.


Results Analysis Summary

Part (a) — Data Split

I split the 768 observations into two equal halves — rows 1 through 384 became my training set, and rows 385 through 768 were held out for testing.

Part (b-i) — Full Logistic Model Findings

Running the full logistic regression revealed that Glucose level was the single strongest predictor of diabetes (p=3.51e-09), followed by BMI (p=1.15e-05) and the Diabetes Pedigree Function (p=0.00216). Pregnancies also reached significance. Blood pressure, skin thickness, insulin, and age did not contribute meaningfully once the other variables were accounted for.

Part (b-ii) — Backward Selection Findings

The backward selection process trimmed the model down to five predictors: Glucose, BMI, DiabetesPedigreeFunction, Pregnancies, and Insulin. Interestingly, Insulin stayed in the model despite a p-value above 0.05 — this is because step() optimizes AIC rather than p-values directly, so variables with borderline significance can remain if they still improve overall fit.

Part (c & d) — Model Performance on Test Data

Model Accuracy Precision Recall
Full Logistic 80.5% 75.0% 58.5%
Small Logistic 79.7% 73.2% 57.7%
Random Forest 79.4% 71.6% 59.3%

All three models achieved similar overall accuracy, but differed in how they balanced precision and recall.

Part (e) — 2x2 Table Summary (Fixed Split)

Model TP FN FP TN
Full Logistic 72 51 24 237
Small Logistic 71 52 26 235
Random Forest 73 50 29 232

Part (f) — Best Model Discussion

No single model dominates across every metric, but the answer depends on what matters most clinically. The Full Logistic model had the fewest false positives (24), making it the safest choice when avoiding unnecessary follow-up is important. Random Forest caught one extra true positive (73 vs 72) but at the cost of more false alarms (29 vs 24). For a screening tool where missing a real case is the bigger concern, the difference between models is small enough that simplicity favors the logistic approach.

Part (g) — Insights from Random Splits

Split Full TP/FP Small TP/FP RF TP/FP
Fixed (rows 1-384) 72/24 71/26 73/29
Random Split 1 79/24 79/23 80/35
Random Split 2 78/30 76/30 74/40

The random splits revealed something important: which patients end up in training vs testing matters a lot with only 768 observations. The logistic models held steady across all three splits while Random Forest’s false positive count jumped considerably depending on the split. This instability points to overfitting — Random Forest builds complex trees that work well on whatever data it trained on but generalize less reliably to new observations when the dataset is this small.