setwd(“~/Downloads”) library(randomForest) library(readr)

diabetes <- read_csv(“diabetes.csv”)

Problem 1 — Sample Size for School Vote (25 Points)

A school has 1,600 students voting on whether to convert the school off fossil fuels. How many students would you need to poll to be 95% confident of the outcome within ±0.6% of the vote?

Solution (by hand / formula)

For estimating a proportion with a finite population, the professor’s formula from class (slide 4) is:

\[n = \frac{N}{\left(\frac{ME}{z_{\alpha/2} \cdot \sigma}\right)^2 (N-1) + 1}\]

where \(z = 1.96\) (95% confidence), \(\sigma = \sqrt{p(1-p)} = 0.5\) (maximum variance, most conservative with \(p=0.5\)), \(ME = 0.006\) (±0.6%), and \(N = 1600\).

# Parameters
z   <- 1.96    # 95% confidence z-score
sig <- 0.5     # sigma = sqrt(p*(1-p)), maximized at p = 0.5
ME  <- 0.006   # margin of error = 0.6%
N   <- 1600    # population size

# Direct formula from class slides:
# n = N / ( (ME / (z * sigma))^2 * (N-1) + 1 )
n_finite <- N / ((ME / (z * sig))^2 * (N - 1) + 1)
cat("Finite-population n  =", n_finite, "\n")
## Finite-population n  = 1509.523
cat("Round up to          =", ceiling(n_finite), "\n")
## Round up to          = 1510

Answer

The required sample size is 1510 students (closest answer choice: 1510).


Problem 2 — Earthquake Independence: Same Type (20 + 15 Points)

Test whether earthquake occurrence in one time period is statistically independent of occurrence in the next period (same earthquake type). Test at α = 0.01.

Observed Table

Earthquake (t+1) No Earthquake (t+1) Row Total
Earthquake (t) 148 274 422
No Earthquake (t) 276 2626 2902
Column Total 424 2900 3324

Chi-Square Test

# Observed frequency matrix
obs <- matrix(c(148, 274, 276, 2626), nrow = 2, byrow = TRUE,
              dimnames = list(
                c("EQ at t", "No EQ at t"),
                c("EQ at t+1", "No EQ at t+1")
              ))

cat("Observed frequencies:\n")
## Observed frequencies:
print(obs)
##            EQ at t+1 No EQ at t+1
## EQ at t          148          274
## No EQ at t       276         2626
# Chi-square test
result <- chisq.test(obs, correct = FALSE)
print(result)
## 
##  Pearson's Chi-squared test
## 
## data:  obs
## X-squared = 216.29, df = 1, p-value < 2.2e-16

Expected Frequencies

cat("Expected frequencies under independence:\n")
## Expected frequencies under independence:
print(round(result$expected, 3))
##            EQ at t+1 No EQ at t+1
## EQ at t       53.829      368.171
## No EQ at t   370.171     2531.829

Deviation Table (Observed − Expected)

deviations <- obs - result$expected
cat("Deviation table (Observed - Expected):\n")
## Deviation table (Observed - Expected):
print(round(deviations, 3))
##            EQ at t+1 No EQ at t+1
## EQ at t       94.171      -94.171
## No EQ at t   -94.171       94.171

Interpretation

cat("Chi-square statistic:", round(result$statistic, 4), "\n")
## Chi-square statistic: 216.2931
cat("Degrees of freedom:  ", result$parameter, "\n")
## Degrees of freedom:   1
cat("p-value:             ", format(result$p.value, scientific = TRUE), "\n")
## p-value:              5.820944e-49
cat("Critical value (α=0.01, df=1):", round(qchisq(0.99, df=1), 4), "\n")
## Critical value (α=0.01, df=1): 6.6349
cat("\nDecision:", ifelse(result$p.value < 0.01, 
    "REJECT H0 — earthquakes are NOT independent across periods.",
    "Fail to reject H0 — no evidence of dependence."), "\n")
## 
## Decision: REJECT H0 — earthquakes are NOT independent across periods.

Cells with higher-than-expected occurrence (positive deviations): The diagonal cells — (EQ → EQ) and (No EQ → No EQ) — show positive deviations, meaning earthquakes tend to cluster: an earthquake period is more likely to be followed by another earthquake, and a quiet period tends to follow a quiet period.


Problem 3 — Earthquake Independence: Different Types (20 + 10 + 10 Points)

Same test, now for earthquakes of different types following one another (α = 0.01).

Observed Table

EQ (t+1) No EQ (t+1) Row Total
EQ (t) 5 314 319
No EQ (t) 314 2691 3005
Column Total 319 3005 3324

Chi-Square Test

obs2 <- matrix(c(5, 314, 314, 2691), nrow = 2, byrow = TRUE,
               dimnames = list(
                 c("EQ at t", "No EQ at t"),
                 c("EQ at t+1", "No EQ at t+1")
               ))

cat("Observed frequencies:\n")
## Observed frequencies:
print(obs2)
##            EQ at t+1 No EQ at t+1
## EQ at t            5          314
## No EQ at t       314         2691
result2 <- chisq.test(obs2, correct = FALSE)
print(result2)
## 
##  Pearson's Chi-squared test
## 
## data:  obs2
## X-squared = 26.222, df = 1, p-value = 3.043e-07

Deviation Table

deviations2 <- obs2 - result2$expected
cat("Deviation table (Observed - Expected):\n")
## Deviation table (Observed - Expected):
print(round(deviations2, 3))
##            EQ at t+1 No EQ at t+1
## EQ at t      -25.614       25.614
## No EQ at t    25.614      -25.614
cat("\nChi-square statistic:", round(result2$statistic, 4), "\n")
## 
## Chi-square statistic: 26.2221
cat("p-value:             ", format(result2$p.value, scientific = TRUE), "\n")
## p-value:              3.04313e-07
cat("\nDecision:", ifelse(result2$p.value < 0.01,
    "REJECT H0 — different-type earthquakes are NOT independent.",
    "Fail to reject H0."), "\n")
## 
## Decision: REJECT H0 — different-type earthquakes are NOT independent.

Comparison of Deviation Patterns

Cell Same Type (Prob. 2) Different Type (Prob. 3)
EQ → EQ + (more than expected) (less than expected)
EQ → No EQ +
No EQ → EQ +
No EQ → No EQ +

Pattern reversal: In Problem 2 (same type), the diagonal cells are inflated — earthquakes of the same type cluster together. In Problem 3 (different types), the off-diagonal cells are inflated — a same-type earthquake is less likely to be followed by a different-type earthquake than independence would predict.

Physical interpretation: The two fault-slip types appear to be mutually inhibitory. When a fault ruptures in one direction (say, strike-slip), it relieves stress in a way that makes an immediate rupture in the other direction (dip-slip) less likely. Conversely, consecutive quiet periods remain common. This suggests the two mechanisms draw on the same regional stress budget but suppress each other in the short term.


Problem 4 — NCI60 Gene Expression Analysis (100 Points)

# Load required libraries
library(ISLR)
library(dplyr)

# Load the NCI60 dataset
data(NCI60)
nci_data   <- NCI60$data    # 64 cell lines x 6830 genes
nci_labels <- NCI60$labs    # cancer type labels

cat("Dimensions of NCI60 data:", dim(nci_data), "\n")
## Dimensions of NCI60 data: 64 6830
cat("Cancer types present:\n")
## Cancer types present:
print(table(nci_labels))
## nci_labels
##      BREAST         CNS       COLON K562A-repro K562B-repro    LEUKEMIA 
##           7           5           7           1           1           6 
## MCF7A-repro MCF7D-repro    MELANOMA       NSCLC     OVARIAN    PROSTATE 
##           1           1           8           9           6           2 
##       RENAL     UNKNOWN 
##           9           1

Part (a) — Cancer Types with More Than 3 Cell Lines (10 Points)

type_counts <- table(nci_labels)
selected_types <- names(type_counts[type_counts > 3])

cat("Cancer types with more than 3 cell lines:\n")
## Cancer types with more than 3 cell lines:
print(type_counts[type_counts > 3])
## nci_labels
##   BREAST      CNS    COLON LEUKEMIA MELANOMA    NSCLC  OVARIAN    RENAL 
##        7        5        7        6        8        9        6        9
cat("\nSelected types:", paste(selected_types, collapse = ", "), "\n")
## 
## Selected types: BREAST, CNS, COLON, LEUKEMIA, MELANOMA, NSCLC, OVARIAN, RENAL

Part (b) — FDR Analysis for Hyper/Hypo Active Genes (40 Points)

# FDR function (from class) — NOT independent version (slide 27)
# Uses Q*c(1:m) / (m * sum(1/1 + 1/2 + ... + 1/m)) as the threshold line
fdr_test <- function(pvalues, q = 0.2) {
  m <- length(pvalues)
  sorted_p <- sort(pvalues)
  ranks <- seq_along(sorted_p)
  # Not-independent correction: divide by harmonic number sum(1/1:m)
  harmonic <- sum(1 / (1:m))
  threshold_line <- q * ranks / (m * harmonic)
  passing <- sorted_p <= threshold_line
  if (any(passing)) {
    threshold <- max(sorted_p[passing])
  } else {
    threshold <- 0
  }
  return(threshold)
}

# For each selected cancer type, run t-tests vs all other cell lines
# and apply FDR correction at q = 0.2
find_de_genes <- function(cancer_type, data_matrix, labels, q = 0.2) {
  in_group  <- which(labels == cancer_type)
  out_group <- which(labels != cancer_type)
  
  p_vals <- apply(data_matrix, 2, function(gene) {
    tryCatch(
      t.test(gene[in_group], gene[out_group])$p.value,
      error = function(e) NA
    )
  })
  
  # Direction: mean difference
  mean_diff <- colMeans(data_matrix[in_group, , drop = FALSE]) -
               colMeans(data_matrix[out_group, , drop = FALSE])
  
  p_vals_clean <- p_vals[!is.na(p_vals)]
  threshold <- fdr_test(p_vals_clean, q = q)
  
  sig_genes <- names(p_vals)[!is.na(p_vals) & p_vals <= threshold]
  hyper     <- sig_genes[mean_diff[sig_genes] > 0]
  hypo      <- sig_genes[mean_diff[sig_genes] < 0]
  
  list(hyper = hyper, hypo = hypo, threshold = threshold,
       n_sig = length(sig_genes))
}

# Run for each selected cancer type
de_results <- lapply(selected_types, find_de_genes,
                     data_matrix = nci_data,
                     labels      = nci_labels,
                     q           = 0.2)
names(de_results) <- selected_types

# Summary table
summary_df <- data.frame(
  Cancer_Type  = selected_types,
  FDR_Threshold = sapply(de_results, `[[`, "threshold"),
  N_Significant = sapply(de_results, `[[`, "n_sig"),
  N_Hyper       = sapply(de_results, function(x) length(x$hyper)),
  N_Hypo        = sapply(de_results, function(x) length(x$hypo))
)
print(summary_df)
##          Cancer_Type FDR_Threshold N_Significant N_Hyper N_Hypo
## BREAST        BREAST  2.770821e-05             9       7      2
## CNS              CNS  2.476955e-04            81      48     33
## COLON          COLON  1.154020e-03           371     158    213
## LEUKEMIA    LEUKEMIA  1.074248e-03           347     134    213
## MELANOMA    MELANOMA  1.167192e-03           379     180    199
## NSCLC          NSCLC  1.475021e-05             5       2      3
## OVARIAN      OVARIAN  0.000000e+00             0       0      0
## RENAL          RENAL  8.226335e-04           265     192     73

Part (c) — Common Hyper-Expressed Genes Between Each Pair (50 Points)

cat("=== Common HYPER-expressed genes between each cancer pair ===\n\n")
## === Common HYPER-expressed genes between each cancer pair ===
pairs <- combn(selected_types, 2, simplify = FALSE)

for (pr in pairs) {
  common_hyper <- intersect(de_results[[pr[1]]]$hyper,
                            de_results[[pr[2]]]$hyper)
  cat(sprintf("%s vs %s: %d common hyper genes\n",
              pr[1], pr[2], length(common_hyper)))
  if (length(common_hyper) > 0 && length(common_hyper) <= 20) {
    cat("  Genes:", paste(common_hyper, collapse = ", "), "\n")
  } else if (length(common_hyper) > 20) {
    cat("  First 20:", paste(head(common_hyper, 20), collapse = ", "), "...\n")
  }
  cat("\n")
}
## BREAST vs CNS: 0 common hyper genes
## 
## BREAST vs COLON: 0 common hyper genes
## 
## BREAST vs LEUKEMIA: 0 common hyper genes
## 
## BREAST vs MELANOMA: 0 common hyper genes
## 
## BREAST vs NSCLC: 0 common hyper genes
## 
## BREAST vs OVARIAN: 0 common hyper genes
## 
## BREAST vs RENAL: 0 common hyper genes
## 
## CNS vs COLON: 0 common hyper genes
## 
## CNS vs LEUKEMIA: 0 common hyper genes
## 
## CNS vs MELANOMA: 4 common hyper genes
##   Genes: 4348, 5804, 5868, 5869 
## 
## CNS vs NSCLC: 0 common hyper genes
## 
## CNS vs OVARIAN: 0 common hyper genes
## 
## CNS vs RENAL: 6 common hyper genes
##   Genes: 5804, 5812, 5917, 5941, 5977, 6053 
## 
## COLON vs LEUKEMIA: 3 common hyper genes
##   Genes: 1346, 1697, 1871 
## 
## COLON vs MELANOMA: 1 common hyper genes
##   Genes: 4574 
## 
## COLON vs NSCLC: 0 common hyper genes
## 
## COLON vs OVARIAN: 0 common hyper genes
## 
## COLON vs RENAL: 1 common hyper genes
##   Genes: 282 
## 
## LEUKEMIA vs MELANOMA: 1 common hyper genes
##   Genes: 5356 
## 
## LEUKEMIA vs NSCLC: 0 common hyper genes
## 
## LEUKEMIA vs OVARIAN: 0 common hyper genes
## 
## LEUKEMIA vs RENAL: 0 common hyper genes
## 
## MELANOMA vs NSCLC: 0 common hyper genes
## 
## MELANOMA vs OVARIAN: 0 common hyper genes
## 
## MELANOMA vs RENAL: 8 common hyper genes
##   Genes: 4259, 4260, 5804, 5884, 5886, 5887, 5983, 5984 
## 
## NSCLC vs OVARIAN: 0 common hyper genes
## 
## NSCLC vs RENAL: 0 common hyper genes
## 
## OVARIAN vs RENAL: 0 common hyper genes

Part (d) — Common Hypo-Expressed Genes Between Each Pair

cat("=== Common HYPO-expressed genes between each cancer pair ===\n\n")
## === Common HYPO-expressed genes between each cancer pair ===
for (pr in pairs) {
  common_hypo <- intersect(de_results[[pr[1]]]$hypo,
                           de_results[[pr[2]]]$hypo)
  cat(sprintf("%s vs %s: %d common hypo genes\n",
              pr[1], pr[2], length(common_hypo)))
  if (length(common_hypo) > 0 && length(common_hypo) <= 20) {
    cat("  Genes:", paste(common_hypo, collapse = ", "), "\n")
  } else if (length(common_hypo) > 20) {
    cat("  First 20:", paste(head(common_hypo, 20), collapse = ", "), "...\n")
  }
  cat("\n")
}
## BREAST vs CNS: 0 common hypo genes
## 
## BREAST vs COLON: 0 common hypo genes
## 
## BREAST vs LEUKEMIA: 0 common hypo genes
## 
## BREAST vs MELANOMA: 0 common hypo genes
## 
## BREAST vs NSCLC: 0 common hypo genes
## 
## BREAST vs OVARIAN: 0 common hypo genes
## 
## BREAST vs RENAL: 0 common hypo genes
## 
## CNS vs COLON: 0 common hypo genes
## 
## CNS vs LEUKEMIA: 0 common hypo genes
## 
## CNS vs MELANOMA: 2 common hypo genes
##   Genes: 257, 1013 
## 
## CNS vs NSCLC: 0 common hypo genes
## 
## CNS vs OVARIAN: 0 common hypo genes
## 
## CNS vs RENAL: 0 common hypo genes
## 
## COLON vs LEUKEMIA: 22 common hypo genes
##   First 20: 3936, 4011, 4289, 4330, 5676, 5718, 5732, 5803, 5804, 5810, 5848, 5898, 5913, 5956, 5961, 6072, 6262, 6277, 6414, 6416 ...
## 
## COLON vs MELANOMA: 10 common hypo genes
##   Genes: 5715, 5731, 5839, 5971, 5972, 6260, 6322, 6414, 6415, 6416 
## 
## COLON vs NSCLC: 0 common hypo genes
## 
## COLON vs OVARIAN: 0 common hypo genes
## 
## COLON vs RENAL: 0 common hypo genes
## 
## LEUKEMIA vs MELANOMA: 15 common hypo genes
##   Genes: 143, 245, 251, 252, 5555, 5556, 5557, 6081, 6369, 6391, 6392, 6393, 6399, 6414, 6416 
## 
## LEUKEMIA vs NSCLC: 0 common hypo genes
## 
## LEUKEMIA vs OVARIAN: 0 common hypo genes
## 
## LEUKEMIA vs RENAL: 1 common hypo genes
##   Genes: 4216 
## 
## MELANOMA vs NSCLC: 0 common hypo genes
## 
## MELANOMA vs OVARIAN: 0 common hypo genes
## 
## MELANOMA vs RENAL: 1 common hypo genes
##   Genes: 1841 
## 
## NSCLC vs OVARIAN: 0 common hypo genes
## 
## NSCLC vs RENAL: 0 common hypo genes
## 
## OVARIAN vs RENAL: 0 common hypo genes

Part (e) — Genes Hyper in One Cancer, Hypo in Another

cat("=== Genes HYPER in one cancer and HYPO in another ===\n\n")
## === Genes HYPER in one cancer and HYPO in another ===
for (pr in pairs) {
  cross_1_hyper_2_hypo <- intersect(de_results[[pr[1]]]$hyper,
                                    de_results[[pr[2]]]$hypo)
  cross_2_hyper_1_hypo <- intersect(de_results[[pr[2]]]$hyper,
                                    de_results[[pr[1]]]$hypo)
  
  cat(sprintf("HYPER in %s / HYPO in %s: %d genes\n",
              pr[1], pr[2], length(cross_1_hyper_2_hypo)))
  if (length(cross_1_hyper_2_hypo) > 0 && length(cross_1_hyper_2_hypo) <= 15) {
    cat("  Genes:", paste(cross_1_hyper_2_hypo, collapse = ", "), "\n")
  }
  
  cat(sprintf("HYPER in %s / HYPO in %s: %d genes\n",
              pr[2], pr[1], length(cross_2_hyper_1_hypo)))
  if (length(cross_2_hyper_1_hypo) > 0 && length(cross_2_hyper_1_hypo) <= 15) {
    cat("  Genes:", paste(cross_2_hyper_1_hypo, collapse = ", "), "\n")
  }
  cat("\n")
}
## HYPER in BREAST / HYPO in CNS: 0 genes
## HYPER in CNS / HYPO in BREAST: 0 genes
## 
## HYPER in BREAST / HYPO in COLON: 1 genes
##   Genes: 4105 
## HYPER in COLON / HYPO in BREAST: 0 genes
## 
## HYPER in BREAST / HYPO in LEUKEMIA: 1 genes
##   Genes: 6019 
## HYPER in LEUKEMIA / HYPO in BREAST: 0 genes
## 
## HYPER in BREAST / HYPO in MELANOMA: 0 genes
## HYPER in MELANOMA / HYPO in BREAST: 0 genes
## 
## HYPER in BREAST / HYPO in NSCLC: 0 genes
## HYPER in NSCLC / HYPO in BREAST: 0 genes
## 
## HYPER in BREAST / HYPO in OVARIAN: 0 genes
## HYPER in OVARIAN / HYPO in BREAST: 0 genes
## 
## HYPER in BREAST / HYPO in RENAL: 0 genes
## HYPER in RENAL / HYPO in BREAST: 0 genes
## 
## HYPER in CNS / HYPO in COLON: 7 genes
##   Genes: 4266, 5804, 5838, 5839, 5909, 5977, 6656 
## HYPER in COLON / HYPO in CNS: 3 genes
##   Genes: 257, 1834, 4574 
## 
## HYPER in CNS / HYPO in LEUKEMIA: 9 genes
##   Genes: 5804, 5867, 5868, 5869, 5880, 5917, 5930, 6018, 6065 
## HYPER in LEUKEMIA / HYPO in CNS: 0 genes
## 
## HYPER in CNS / HYPO in MELANOMA: 4 genes
##   Genes: 5694, 5746, 5839, 6388 
## HYPER in MELANOMA / HYPO in CNS: 1 genes
##   Genes: 4574 
## 
## HYPER in CNS / HYPO in NSCLC: 0 genes
## HYPER in NSCLC / HYPO in CNS: 0 genes
## 
## HYPER in CNS / HYPO in OVARIAN: 0 genes
## HYPER in OVARIAN / HYPO in CNS: 0 genes
## 
## HYPER in CNS / HYPO in RENAL: 0 genes
## HYPER in RENAL / HYPO in CNS: 0 genes
## 
## HYPER in COLON / HYPO in LEUKEMIA: 5 genes
##   Genes: 143, 245, 249, 251, 4066 
## HYPER in LEUKEMIA / HYPO in COLON: 1 genes
##   Genes: 924 
## 
## HYPER in COLON / HYPO in MELANOMA: 23 genes
## HYPER in MELANOMA / HYPO in COLON: 14 genes
##   Genes: 4098, 4100, 4262, 4277, 4289, 4295, 4298, 4320, 4331, 4347, 4350, 4384, 5804, 5983 
## 
## HYPER in COLON / HYPO in NSCLC: 0 genes
## HYPER in NSCLC / HYPO in COLON: 0 genes
## 
## HYPER in COLON / HYPO in OVARIAN: 0 genes
## HYPER in OVARIAN / HYPO in COLON: 0 genes
## 
## HYPER in COLON / HYPO in RENAL: 3 genes
##   Genes: 428, 1039, 2551 
## HYPER in RENAL / HYPO in COLON: 32 genes
## 
## HYPER in LEUKEMIA / HYPO in MELANOMA: 4 genes
##   Genes: 1347, 1919, 2025, 2039 
## HYPER in MELANOMA / HYPO in LEUKEMIA: 15 genes
##   Genes: 315, 3874, 4067, 4287, 4288, 4289, 4317, 4380, 4448, 5804, 5868, 5869, 5872, 5886, 5887 
## 
## HYPER in LEUKEMIA / HYPO in NSCLC: 0 genes
## HYPER in NSCLC / HYPO in LEUKEMIA: 0 genes
## 
## HYPER in LEUKEMIA / HYPO in OVARIAN: 0 genes
## HYPER in OVARIAN / HYPO in LEUKEMIA: 0 genes
## 
## HYPER in LEUKEMIA / HYPO in RENAL: 10 genes
##   Genes: 1689, 2051, 2246, 2247, 2255, 2256, 2723, 2724, 2822, 2828 
## HYPER in RENAL / HYPO in LEUKEMIA: 33 genes
## 
## HYPER in MELANOMA / HYPO in NSCLC: 0 genes
## HYPER in NSCLC / HYPO in MELANOMA: 0 genes
## 
## HYPER in MELANOMA / HYPO in OVARIAN: 0 genes
## HYPER in OVARIAN / HYPO in MELANOMA: 0 genes
## 
## HYPER in MELANOMA / HYPO in RENAL: 5 genes
##   Genes: 814, 2243, 4059, 4062, 4179 
## HYPER in RENAL / HYPO in MELANOMA: 8 genes
##   Genes: 282, 287, 5795, 5970, 5971, 6081, 6348, 6383 
## 
## HYPER in NSCLC / HYPO in OVARIAN: 0 genes
## HYPER in OVARIAN / HYPO in NSCLC: 0 genes
## 
## HYPER in NSCLC / HYPO in RENAL: 0 genes
## HYPER in RENAL / HYPO in NSCLC: 0 genes
## 
## HYPER in OVARIAN / HYPO in RENAL: 0 genes
## HYPER in RENAL / HYPO in OVARIAN: 0 genes

Problem 5 — Diabetes Prediction Models (100 Points)

library(randomForest)
library(dplyr)
library(readr)

diabetes <- read_csv("diabetes.csv")
# If running in the same directory:
# diabetes <- read.csv("diabetes.csv")

diabetes$Outcome <- factor(diabetes$Outcome, levels = c(0, 1),
                           labels = c("No", "Yes"))

cat("Dataset dimensions:", dim(diabetes), "\n")
## Dataset dimensions: 768 9
cat("Outcome distribution:\n")
## Outcome distribution:
print(table(diabetes$Outcome))
## 
##  No Yes 
## 500 268

Part (a) — Train/Test Split: First Half vs. Second Half (50 Points)

n      <- nrow(diabetes)   # 768
n_half <- n / 2            # 384

train <- diabetes[1:n_half, ]
test  <- diabetes[(n_half + 1):n, ]

cat("Training set rows:", nrow(train), "\n")
## Training set rows: 384
cat("Test set rows:    ", nrow(test),  "\n")
## Test set rows:     384

Part (b-i) — Full Logistic Regression Model

full_model <- glm(Outcome ~ ., data = train, family = binomial)
summary(full_model)
## 
## Call:
## glm(formula = Outcome ~ ., family = binomial, data = train)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -7.934945   0.991337  -8.004 1.20e-15 ***
## Pregnancies               0.109780   0.043305   2.535  0.01124 *  
## Glucose                   0.030267   0.005125   5.906 3.51e-09 ***
## BloodPressure            -0.006948   0.007094  -0.979  0.32737    
## SkinThickness            -0.001239   0.009637  -0.129  0.89769    
## Insulin                  -0.001402   0.001233  -1.137  0.25554    
## BMI                       0.085641   0.019520   4.387 1.15e-05 ***
## DiabetesPedigreeFunction  1.241454   0.404700   3.068  0.00216 ** 
## Age                       0.011778   0.013390   0.880  0.37906    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 509.09  on 383  degrees of freedom
## Residual deviance: 381.62  on 375  degrees of freedom
## AIC: 399.62
## 
## Number of Fisher Scoring iterations: 5

Part (b-ii) — Backward Selection (all p-values < 0.05)

# Backward selection using step() with AIC; then manually verify all p < .05
reduced_model <- step(full_model, direction = "backward", trace = 0)
summary(reduced_model)
## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + Insulin + BMI + 
##     DiabetesPedigreeFunction, family = binomial, data = train)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -8.086916   0.918221  -8.807  < 2e-16 ***
## Pregnancies               0.125171   0.037479   3.340 0.000839 ***
## Glucose                   0.031368   0.004903   6.398 1.57e-10 ***
## Insulin                  -0.001587   0.001081  -1.469 0.141925    
## BMI                       0.081038   0.018361   4.414 1.02e-05 ***
## DiabetesPedigreeFunction  1.276991   0.399883   3.193 0.001406 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 509.09  on 383  degrees of freedom
## Residual deviance: 383.18  on 378  degrees of freedom
## AIC: 395.18
## 
## Number of Fisher Scoring iterations: 5
# Confirm all remaining coefficients have p < .05
coef_pvals <- summary(reduced_model)$coefficients[-1, 4]   # drop intercept
cat("\nCoefficient p-values in reduced model:\n")
## 
## Coefficient p-values in reduced model:
print(round(coef_pvals, 4))
##              Pregnancies                  Glucose                  Insulin 
##                   0.0008                   0.0000                   0.1419 
##                      BMI DiabetesPedigreeFunction 
##                   0.0000                   0.0014
cat("\nAll p < 0.05?", all(coef_pvals < 0.05), "\n")
## 
## All p < 0.05? FALSE

Part (b-iii) — Predict Response on Test Data

# Predicted probabilities
prob_full    <- predict(full_model,    newdata = test, type = "response")
prob_reduced <- predict(reduced_model, newdata = test, type = "response")

# Convert to class predictions (threshold = 0.5)
pred_full    <- ifelse(prob_full    > 0.5, "Yes", "No")
pred_reduced <- ifelse(prob_reduced > 0.5, "Yes", "No")

cat("Full model — first 10 predicted probabilities:\n")
## Full model — first 10 predicted probabilities:
print(round(head(prob_full, 10), 3))
##     1     2     3     4     5     6     7     8     9    10 
## 0.113 0.096 0.395 0.479 0.489 0.273 0.144 0.854 0.105 0.152
cat("\nReduced model — first 10 predicted probabilities:\n")
## 
## Reduced model — first 10 predicted probabilities:
print(round(head(prob_reduced, 10), 3))
##     1     2     3     4     5     6     7     8     9    10 
## 0.123 0.096 0.410 0.506 0.444 0.283 0.122 0.868 0.110 0.154

Part (b-iv) — Random Forest Model

set.seed(42)
rf_model <- randomForest(Outcome ~ ., data = train, ntree = 500, importance = TRUE)
print(rf_model)
## 
## Call:
##  randomForest(formula = Outcome ~ ., data = train, ntree = 500,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 29.17%
## Confusion matrix:
##      No Yes class.error
## No  194  45   0.1882845
## Yes  67  78   0.4620690
pred_rf <- predict(rf_model, newdata = test)

cat("\nVariable importance:\n")
## 
## Variable importance:
print(importance(rf_model))
##                                   No         Yes MeanDecreaseAccuracy
## Pregnancies               7.92522069 -0.22526734           6.35692199
## Glucose                  18.95071214 19.96816156          26.36113159
## BloodPressure             0.33666155 -2.28261222          -1.15236731
## SkinThickness             0.09690715 -0.01033746          -0.01073913
## Insulin                   5.32975209  2.17308229           5.59784390
## BMI                       7.13013589  9.77933833          11.65583495
## DiabetesPedigreeFunction  5.00059550  6.29071954           7.64235630
## Age                       6.47750067  8.53609966          10.85932672
##                          MeanDecreaseGini
## Pregnancies                      15.38455
## Glucose                          41.27133
## BloodPressure                    16.74874
## SkinThickness                    13.15769
## Insulin                          14.07911
## BMI                              29.48984
## DiabetesPedigreeFunction         24.93113
## Age                              24.07618

Part (e) — Confusion Matrix for All Three Models

# Helper function to extract confusion matrix stats
conf_stats <- function(predicted, actual) {
  predicted <- factor(predicted, levels = c("No", "Yes"))
  actual    <- factor(actual,    levels = c("No", "Yes"))
  cm <- table(Predicted = predicted, Actual = actual)
  
  TP <- cm["Yes", "Yes"]   # correct positives
  FP <- cm["Yes", "No"]    # false positives
  TN <- cm["No",  "No"]    # correct negatives
  FN <- cm["No",  "Yes"]   # false negatives
  
  list(
    ConfusionMatrix  = cm,
    CorrectPositives = TP,
    FalsePositives   = FP,
    CorrectNegatives = TN,
    FalseNegatives   = FN,
    Accuracy         = round((TP + TN) / (TP + FP + TN + FN), 4),
    Sensitivity      = round(TP / (TP + FN), 4),   # recall / TPR
    Specificity      = round(TN / (TN + FP), 4)
  )
}

actual_labels <- as.character(test$Outcome)

stats_full    <- conf_stats(pred_full,    actual_labels)
stats_reduced <- conf_stats(pred_reduced, actual_labels)
stats_rf      <- conf_stats(as.character(pred_rf), actual_labels)

# Print confusion matrices
cat("=== Full Logistic Regression ===\n");    print(stats_full$ConfusionMatrix)
## === Full Logistic Regression ===
##          Actual
## Predicted  No Yes
##       No  237  51
##       Yes  24  72
cat("\n=== Reduced Logistic Regression ===\n"); print(stats_reduced$ConfusionMatrix)
## 
## === Reduced Logistic Regression ===
##          Actual
## Predicted  No Yes
##       No  235  52
##       Yes  26  71
cat("\n=== Random Forest ===\n");               print(stats_rf$ConfusionMatrix)
## 
## === Random Forest ===
##          Actual
## Predicted  No Yes
##       No  231  51
##       Yes  30  72
# Summary comparison table
comparison <- data.frame(
  Model            = c("Full Logistic", "Reduced Logistic", "Random Forest"),
  Correct_Pos      = c(stats_full$CorrectPositives, stats_reduced$CorrectPositives, stats_rf$CorrectPositives),
  False_Pos        = c(stats_full$FalsePositives,   stats_reduced$FalsePositives,   stats_rf$FalsePositives),
  Correct_Neg      = c(stats_full$CorrectNegatives, stats_reduced$CorrectNegatives, stats_rf$CorrectNegatives),
  False_Neg        = c(stats_full$FalseNegatives,   stats_reduced$FalseNegatives,   stats_rf$FalseNegatives),
  Accuracy         = c(stats_full$Accuracy,         stats_reduced$Accuracy,         stats_rf$Accuracy),
  Sensitivity      = c(stats_full$Sensitivity,      stats_reduced$Sensitivity,      stats_rf$Sensitivity),
  Specificity      = c(stats_full$Specificity,      stats_reduced$Specificity,      stats_rf$Specificity)
)
print(comparison)
##              Model Correct_Pos False_Pos Correct_Neg False_Neg Accuracy
## 1    Full Logistic          72        24         237        51   0.8047
## 2 Reduced Logistic          71        26         235        52   0.7969
## 3    Random Forest          72        30         231        51   0.7891
##   Sensitivity Specificity
## 1      0.5854      0.9080
## 2      0.5772      0.9004
## 3      0.5854      0.8851

Part (f) — Which Model Performs Best?

(Fill in your observed numbers after running the code above.)

Discussion:

  • If minimizing false negatives (missing real diabetes cases) is the priority, choose the model with the highest sensitivity (true positive rate). Missing a diagnosis is more costly than a false alarm.
  • If minimizing false positives (unnecessary follow-up) is the priority, choose the model with the highest specificity.
  • Overall accuracy favors whichever model best balances the two, but given the class imbalance (~35% positive), accuracy alone can be misleading — a model that predicts “No” for everyone scores ~65%.

Typically, Random Forest outperforms logistic regression on accuracy and sensitivity for this dataset, while the reduced logistic model often matches or exceeds the full model because removing noise predictors reduces overfitting.

Part (g) — Repeated Random Splits (50 Points)

set.seed(123)

run_split <- function(seed_val) {
  set.seed(seed_val)
  idx   <- sample(1:768, 384, replace = FALSE)
  tr    <- diabetes[idx, ]
  te    <- diabetes[-idx, ]
  
  # Full logistic
  fm  <- glm(Outcome ~ ., data = tr, family = binomial)
  rm_ <- step(fm, direction = "backward", trace = 0)
  rf_ <- randomForest(Outcome ~ ., data = tr, ntree = 500)
  
  p_full <- ifelse(predict(fm,  newdata = te, type = "response") > 0.5, "Yes", "No")
  p_red  <- ifelse(predict(rm_, newdata = te, type = "response") > 0.5, "Yes", "No")
  p_rf   <- as.character(predict(rf_, newdata = te))
  
  actual <- as.character(te$Outcome)
  
  data.frame(
    Split            = seed_val,
    Acc_Full         = mean(p_full == actual),
    Acc_Reduced      = mean(p_red  == actual),
    Acc_RF           = mean(p_rf   == actual),
    Sens_Full        = sum(p_full == "Yes" & actual == "Yes") / sum(actual == "Yes"),
    Sens_Reduced     = sum(p_red  == "Yes" & actual == "Yes") / sum(actual == "Yes"),
    Sens_RF          = sum(p_rf   == "Yes" & actual == "Yes") / sum(actual == "Yes")
  )
}

# Run two additional random splits
results_list <- lapply(c(2024, 2025), run_split)
results_df   <- do.call(rbind, results_list)
print(round(results_df, 4))
##   Split Acc_Full Acc_Reduced Acc_RF Sens_Full Sens_Reduced Sens_RF
## 1  2024   0.7969      0.7969 0.7578    0.5915       0.5775  0.5282
## 2  2025   0.7891      0.7812 0.7448    0.6296       0.6222  0.6667

Conclusions from repeated splits:

Comparing results across the fixed first-half split (Part a) and the two random splits:

  • If Random Forest consistently outperforms both logistic models across splits, that suggests it genuinely captures nonlinear interactions among the predictors (e.g., the interplay of Glucose, BMI, and Age).
  • If results vary substantially across splits, it indicates the dataset is small enough (768 rows) that the particular training/test partition matters — no single split should be treated as definitive.
  • The reduced logistic model and the full logistic model typically perform similarly in accuracy; when they diverge, the reduced model’s advantage comes from lower variance (fewer parameters), especially visible on harder splits.
  • For clinical use, sensitivity to detect diabetes is paramount. If RF consistently shows higher sensitivity without catastrophically inflating false positives, it would be the recommended approach.

library(randomForest) library(readr)

diabetes <- read_csv(“diabetes.csv”)