Problem 1 — Sample Size for School Vote (25 Points)

A school has 1,600 students voting on whether to convert the school off fossil fuels. How many students would you need to poll to be 95% confident of the outcome within ±0.6% of the vote?

Solution (by hand / formula)

For estimating a proportion with a finite population, the professor’s formula from class (slide 4) is:

\[n = \frac{N}{\left(\frac{ME}{z_{\alpha/2} \cdot \sigma}\right)^2 (N-1) + 1}\]

where \(z = 1.96\) (95% confidence), \(\sigma = \sqrt{p(1-p)} = 0.5\) (maximum variance, most conservative with \(p=0.5\)), \(ME = 0.006\) (±0.6%), and \(N = 1600\).

# Parameters
z   <- 1.96    # 95% confidence z-score
sig <- 0.5     # sigma = sqrt(p*(1-p)), maximized at p = 0.5
ME  <- 0.006   # margin of error = 0.6%
N   <- 1600    # population size

# Direct formula from class slides:
# n = N / ( (ME / (z * sigma))^2 * (N-1) + 1 )
n_finite <- N / ((ME / (z * sig))^2 * (N - 1) + 1)
cat("Finite-population n  =", n_finite, "\n")

## Finite-population n  = 1509.523

cat("Round up to          =", ceiling(n_finite), "\n")

## Round up to          = 1510

Answer

The required sample size is 1510 students (closest answer choice: 1510).

Hand Calculation

Problem 2 — Earthquake Independence: Same Type (20 + 15 Points)

Test whether earthquake occurrence in one time period is statistically independent of occurrence in the next period (same earthquake type). Test at α = 0.01.

Observed Table

	Earthquake (t+1)	No Earthquake (t+1)	Row Total
Earthquake (t)	148	274	422
No Earthquake (t)	276	2626	2902
Column Total	424	2900	3324

Chi-Square Test

# Observed frequency matrix
obs <- matrix(c(148, 274, 276, 2626), nrow = 2, byrow = TRUE,
              dimnames = list(
                c("EQ at t", "No EQ at t"),
                c("EQ at t+1", "No EQ at t+1")
              ))

cat("Observed frequencies:\n")

## Observed frequencies:

print(obs)

##            EQ at t+1 No EQ at t+1
## EQ at t          148          274
## No EQ at t       276         2626

# Chi-square test
result <- chisq.test(obs, correct = FALSE)
print(result)

## 
##  Pearson's Chi-squared test
## 
## data:  obs
## X-squared = 216.29, df = 1, p-value < 2.2e-16

Expected Frequencies

cat("Expected frequencies under independence:\n")

## Expected frequencies under independence:

print(round(result$expected, 3))

##            EQ at t+1 No EQ at t+1
## EQ at t       53.829      368.171
## No EQ at t   370.171     2531.829

Deviation Table (Observed − Expected)

deviations <- obs - result$expected
cat("Deviation table (Observed - Expected):\n")

## Deviation table (Observed - Expected):

print(round(deviations, 3))

##            EQ at t+1 No EQ at t+1
## EQ at t       94.171      -94.171
## No EQ at t   -94.171       94.171

Interpretation

cat("Chi-square statistic:", round(result$statistic, 4), "\n")

## Chi-square statistic: 216.2931

cat("Degrees of freedom:  ", result$parameter, "\n")

## Degrees of freedom:   1

cat("p-value:             ", format(result$p.value, scientific = TRUE), "\n")

## p-value:              5.820944e-49

cat("Critical value (α=0.01, df=1):", round(qchisq(0.99, df=1), 4), "\n")

## Critical value (α=0.01, df=1): 6.6349

cat("\nDecision:", ifelse(result$p.value < 0.01, 
    "REJECT H0 — earthquakes are NOT independent across periods.",
    "Fail to reject H0 — no evidence of dependence."), "\n")

## 
## Decision: REJECT H0 — earthquakes are NOT independent across periods.

Cells with higher-than-expected occurrence (positive deviations): The diagonal cells — (EQ → EQ) and (No EQ → No EQ) — show positive deviations, meaning earthquakes tend to cluster: an earthquake period is more likely to be followed by another earthquake, and a quiet period tends to follow a quiet period.

Hand Calculation

Problem 3 — Earthquake Independence: Different Types (20 + 10 + 10 Points)

Same test, now for earthquakes of different types following one another (α = 0.01).

Observed Table

	EQ (t+1)	No EQ (t+1)	Row Total
EQ (t)	5	314	319
No EQ (t)	314	2691	3005
Column Total	319	3005	3324

Chi-Square Test

obs2 <- matrix(c(5, 314, 314, 2691), nrow = 2, byrow = TRUE,
               dimnames = list(
                 c("EQ at t", "No EQ at t"),
                 c("EQ at t+1", "No EQ at t+1")
               ))

cat("Observed frequencies:\n")

## Observed frequencies:

print(obs2)

##            EQ at t+1 No EQ at t+1
## EQ at t            5          314
## No EQ at t       314         2691

result2 <- chisq.test(obs2, correct = FALSE)
print(result2)

## 
##  Pearson's Chi-squared test
## 
## data:  obs2
## X-squared = 26.222, df = 1, p-value = 3.043e-07

Deviation Table

deviations2 <- obs2 - result2$expected
cat("Deviation table (Observed - Expected):\n")

## Deviation table (Observed - Expected):

print(round(deviations2, 3))

##            EQ at t+1 No EQ at t+1
## EQ at t      -25.614       25.614
## No EQ at t    25.614      -25.614

cat("\nChi-square statistic:", round(result2$statistic, 4), "\n")

## 
## Chi-square statistic: 26.2221

cat("p-value:             ", format(result2$p.value, scientific = TRUE), "\n")

## p-value:              3.04313e-07

cat("\nDecision:", ifelse(result2$p.value < 0.01,
    "REJECT H0 — different-type earthquakes are NOT independent.",
    "Fail to reject H0."), "\n")

## 
## Decision: REJECT H0 — different-type earthquakes are NOT independent.

Comparison of Deviation Patterns

Cell	Same Type (Prob. 2)	Different Type (Prob. 3)
EQ → EQ	+ (more than expected)	− (less than expected)
EQ → No EQ	−	+
No EQ → EQ	−	+
No EQ → No EQ	+	−

Pattern reversal: In Problem 2 (same type), the diagonal cells are inflated — earthquakes of the same type cluster together. In Problem 3 (different types), the off-diagonal cells are inflated — a same-type earthquake is less likely to be followed by a different-type earthquake than independence would predict.

Physical interpretation: The two fault-slip types appear to be mutually inhibitory. When a fault ruptures in one direction (say, strike-slip), it relieves stress in a way that makes an immediate rupture in the other direction (dip-slip) less likely. Conversely, consecutive quiet periods remain common. This suggests the two mechanisms draw on the same regional stress budget but suppress each other in the short term.

Hand Calculation

Problem 4 — NCI60 Gene Expression Analysis (100 Points)

library(ISLR)
data(NCI60)

Part (a) — Cancer Types with More Than 3 Cell Lines (10 Points)

table(NCI60$labs)

## 
##      BREAST         CNS       COLON K562A-repro K562B-repro    LEUKEMIA 
##           7           5           7           1           1           6 
## MCF7A-repro MCF7D-repro    MELANOMA       NSCLC     OVARIAN    PROSTATE 
##           1           1           8           9           6           2 
##       RENAL     UNKNOWN 
##           9           1

cancer_counts <- table(NCI60$labs)
cancers_large <- names(cancer_counts[cancer_counts > 3])
cat("Cancers with >3 cell lines:\n")

## Cancers with >3 cell lines:

print(cancers_large)

## [1] "BREAST"   "CNS"      "COLON"    "LEUKEMIA" "MELANOMA" "NSCLC"    "OVARIAN" 
## [8] "RENAL"

Cancers with more than 3 cell lines: BREAST (7), CNS (5), COLON (7), LEUKEMIA (6), MELANOMA (8), NSCLC (9), OVARIAN (6), RENAL (9).

Part (b) — FDR Analysis for Hyper/Hypo Active Genes (40 Points)

# Using the professor's exact fdr.pck code provided in class — this includes the fdr() function and myt() one-sample t-test against 0. The not-independent correction (ind=FALSE) is applied as required.
fdr <- function(v1, Q, ind = F) {
  o1   <- order(v1)
  pvec <- v1[o1]
  m     <- length(v1)
  qline <- Q * c(1:m) / m
  if (!ind) {
    c1    <- sum(1 / (c(1:m)))
    qline <- Q * c(1:m) / (m * c1)
  }
  plot(c(c(1:m), c(1:m)), c(qline, pvec), type = "n",
       xlab = "ordering", ylab = "pvalue")
  lines(c(1:m), qline)
  points(c(1:m), pvec)
  dv  <- pvec - qline
  I22 <- !is.na(dv)
  I1  <- (dv[I22] < 0)
  I0  <- I1
  if (sum(I0) > .5) {
    pmax <- max(pvec[I22][I1])
    I2   <- pvec[I22] <= pmax
    points(c(1:m)[I2], pvec[I2], col = "red")
    out <- list(interesting = o1[I22][I2], ind = ind)
  } else {
    vec <- qbeta(c(.5, .95, .99, .999), 1, length(v1) + 1)
    out <- list(q.5 = vec[1], q.95 = vec[2], q.99 = vec[3], q.999 = vec[4])
  }
  out
}

# myt: one-sample t-test against 0 (professor's exact method)
myt <- function(x) {
  t.test(x, mu = 0)$p.value
}

# Test each cancer type — subset JUST that cancer's data, apply myt per gene
results <- list()
for (cancer in cancers_large) {
  cat("Testing", cancer, "...\n")
  
  cancer_data <- NCI60$data[NCI60$labs == cancer, ]
  pvals <- apply(cancer_data, 2, myt)
  fdr_result <- fdr(pvals, Q = 0.2, ind = FALSE)
  
  if (is.logical(fdr_result$interesting)) {
    interesting <- which(fdr_result$interesting)
  } else {
    interesting <- fdr_result$interesting
  }
  
  results[[cancer]] <- list(
    cancer            = cancer,
    n_samples         = nrow(cancer_data),
    interesting_genes = interesting,
    n_interesting     = length(interesting),
    pvals             = pvals
  )
  cat("  Found", length(interesting), "interesting genes\n\n")
}

## Testing BREAST ...

##   Found 0 interesting genes
## 
## Testing CNS ...

##   Found 0 interesting genes
## 
## Testing COLON ...

##   Found 48 interesting genes
## 
## Testing LEUKEMIA ...

##   Found 25 interesting genes
## 
## Testing MELANOMA ...

##   Found 41 interesting genes
## 
## Testing NSCLC ...

##   Found 0 interesting genes
## 
## Testing OVARIAN ...

##   Found 0 interesting genes
## 
## Testing RENAL ...

##   Found 10 interesting genes

# Summary of results
cat("=== RESULTS AT FDR Q=0.2 (ind=FALSE) ===\n")

## === RESULTS AT FDR Q=0.2 (ind=FALSE) ===

for (cancer in names(results)) {
  cat(sprintf("%-10s: %d genes (n=%d samples)\n",
              cancer, results[[cancer]]$n_interesting, results[[cancer]]$n_samples))
}

## BREAST    : 0 genes (n=7 samples)
## CNS       : 0 genes (n=5 samples)
## COLON     : 48 genes (n=7 samples)
## LEUKEMIA  : 25 genes (n=6 samples)
## MELANOMA  : 41 genes (n=8 samples)
## NSCLC     : 0 genes (n=9 samples)
## OVARIAN   : 0 genes (n=6 samples)
## RENAL     : 10 genes (n=9 samples)

Cancers with hyper/hypo active genes at FDR Q=0.2 (not independent): COLON (48 genes), MELANOMA (41 genes), LEUKEMIA (25 genes), RENAL (10 genes). BREAST, CNS, NSCLC, and OVARIAN had no significant genes at this threshold.

Part (c) — Common Hyper-Expressed Genes Between Each Pair (50 Points)

# Hyper = mean expression > 0 for interesting genes
hyper_genes <- list()
for (cancer in names(results)) {
  cancer_data <- NCI60$data[NCI60$labs == cancer, ]
  means <- colMeans(cancer_data[, results[[cancer]]$interesting_genes, drop = FALSE])
  hyper_genes[[cancer]] <- results[[cancer]]$interesting_genes[means > 0]
}

cat("=== COMMON HYPER-EXPRESSED GENES ===\n")

## === COMMON HYPER-EXPRESSED GENES ===

pairs <- combn(names(results), 2, simplify = FALSE)
for (pair in pairs) {
  c1 <- pair[1]; c2 <- pair[2]
  common_hyper <- intersect(hyper_genes[[c1]], hyper_genes[[c2]])
  cat(sprintf("%s vs %s: %d common hyper-expressed genes\n", c1, c2, length(common_hyper)))
  if (length(common_hyper) > 0) {
    cat("  Genes:", paste(head(common_hyper, 10), collapse = ", "),
        if (length(common_hyper) > 10) "...\n" else "\n")
  }
}

## BREAST vs CNS: 0 common hyper-expressed genes
## BREAST vs COLON: 0 common hyper-expressed genes
## BREAST vs LEUKEMIA: 0 common hyper-expressed genes
## BREAST vs MELANOMA: 0 common hyper-expressed genes
## BREAST vs NSCLC: 0 common hyper-expressed genes
## BREAST vs OVARIAN: 0 common hyper-expressed genes
## BREAST vs RENAL: 0 common hyper-expressed genes
## CNS vs COLON: 0 common hyper-expressed genes
## CNS vs LEUKEMIA: 0 common hyper-expressed genes
## CNS vs MELANOMA: 0 common hyper-expressed genes
## CNS vs NSCLC: 0 common hyper-expressed genes
## CNS vs OVARIAN: 0 common hyper-expressed genes
## CNS vs RENAL: 0 common hyper-expressed genes
## COLON vs LEUKEMIA: 0 common hyper-expressed genes
## COLON vs MELANOMA: 0 common hyper-expressed genes
## COLON vs NSCLC: 0 common hyper-expressed genes
## COLON vs OVARIAN: 0 common hyper-expressed genes
## COLON vs RENAL: 0 common hyper-expressed genes
## LEUKEMIA vs MELANOMA: 0 common hyper-expressed genes
## LEUKEMIA vs NSCLC: 0 common hyper-expressed genes
## LEUKEMIA vs OVARIAN: 0 common hyper-expressed genes
## LEUKEMIA vs RENAL: 0 common hyper-expressed genes
## MELANOMA vs NSCLC: 0 common hyper-expressed genes
## MELANOMA vs OVARIAN: 0 common hyper-expressed genes
## MELANOMA vs RENAL: 0 common hyper-expressed genes
## NSCLC vs OVARIAN: 0 common hyper-expressed genes
## NSCLC vs RENAL: 0 common hyper-expressed genes
## OVARIAN vs RENAL: 0 common hyper-expressed genes

Result: There are no common hyper-expressed genes between any pair of cancers. Each cancer type has unique gene dysregulation patterns specific to its biology.

Part (d) — Common Hypo-Expressed Genes Between Each Pair

# Hypo = mean expression < 0 for interesting genes
hypo_genes <- list()
for (cancer in names(results)) {
  cancer_data <- NCI60$data[NCI60$labs == cancer, ]
  means <- colMeans(cancer_data[, results[[cancer]]$interesting_genes, drop = FALSE])
  hypo_genes[[cancer]] <- results[[cancer]]$interesting_genes[means < 0]
}

cat("=== COMMON HYPO-EXPRESSED GENES ===\n")

## === COMMON HYPO-EXPRESSED GENES ===

for (pair in pairs) {
  c1 <- pair[1]; c2 <- pair[2]
  common_hypo <- intersect(hypo_genes[[c1]], hypo_genes[[c2]])
  cat(sprintf("%s vs %s: %d common hypo-expressed genes\n", c1, c2, length(common_hypo)))
  if (length(common_hypo) > 0) {
    cat("  Genes:", paste(head(common_hypo, 10), collapse = ", "),
        if (length(common_hypo) > 10) "...\n" else "\n")
  }
}

## BREAST vs CNS: 0 common hypo-expressed genes
## BREAST vs COLON: 0 common hypo-expressed genes
## BREAST vs LEUKEMIA: 0 common hypo-expressed genes
## BREAST vs MELANOMA: 0 common hypo-expressed genes
## BREAST vs NSCLC: 0 common hypo-expressed genes
## BREAST vs OVARIAN: 0 common hypo-expressed genes
## BREAST vs RENAL: 0 common hypo-expressed genes
## CNS vs COLON: 0 common hypo-expressed genes
## CNS vs LEUKEMIA: 0 common hypo-expressed genes
## CNS vs MELANOMA: 0 common hypo-expressed genes
## CNS vs NSCLC: 0 common hypo-expressed genes
## CNS vs OVARIAN: 0 common hypo-expressed genes
## CNS vs RENAL: 0 common hypo-expressed genes
## COLON vs LEUKEMIA: 0 common hypo-expressed genes
## COLON vs MELANOMA: 0 common hypo-expressed genes
## COLON vs NSCLC: 0 common hypo-expressed genes
## COLON vs OVARIAN: 0 common hypo-expressed genes
## COLON vs RENAL: 0 common hypo-expressed genes
## LEUKEMIA vs MELANOMA: 0 common hypo-expressed genes
## LEUKEMIA vs NSCLC: 0 common hypo-expressed genes
## LEUKEMIA vs OVARIAN: 0 common hypo-expressed genes
## LEUKEMIA vs RENAL: 0 common hypo-expressed genes
## MELANOMA vs NSCLC: 0 common hypo-expressed genes
## MELANOMA vs OVARIAN: 0 common hypo-expressed genes
## MELANOMA vs RENAL: 0 common hypo-expressed genes
## NSCLC vs OVARIAN: 0 common hypo-expressed genes
## NSCLC vs RENAL: 0 common hypo-expressed genes
## OVARIAN vs RENAL: 0 common hypo-expressed genes

Result: There are no common hypo-expressed genes between any pair of cancers. The conservative FDR=0.2 threshold and cancer-specific biology result in no shared hypo-expressed genes.

Part (e) — Genes Hyper in One Cancer, Hypo in Another

cat("=== HYPER IN ONE, HYPO IN ANOTHER ===\n")

## === HYPER IN ONE, HYPO IN ANOTHER ===

for (i in 1:length(names(results))) {
  for (j in (i + 1):length(names(results))) {
    c1 <- names(results)[i]; c2 <- names(results)[j]
    
    hyper_c1_hypo_c2 <- intersect(hyper_genes[[c1]], hypo_genes[[c2]])
    hyper_c2_hypo_c1 <- intersect(hyper_genes[[c2]], hypo_genes[[c1]])
    
    if (length(hyper_c1_hypo_c2) > 0) {
      cat(sprintf("%s HYPER & %s HYPO: %d genes\n", c1, c2, length(hyper_c1_hypo_c2)))
      cat("  Genes:", paste(head(hyper_c1_hypo_c2, 5), collapse = ", "), "\n")
    }
    if (length(hyper_c2_hypo_c1) > 0) {
      cat(sprintf("%s HYPER & %s HYPO: %d genes\n", c2, c1, length(hyper_c2_hypo_c1)))
      cat("  Genes:", paste(head(hyper_c2_hypo_c1, 5), collapse = ", "), "\n")
    }
  }
}

## COLON HYPER & MELANOMA HYPO: 1 genes
##   Genes: 247 
## MELANOMA HYPER & COLON HYPO: 1 genes
##   Genes: 4384

Result: Gene 247 is hyper-expressed in COLON and hypo-expressed in MELANOMA. Gene 4384 is hyper-expressed in MELANOMA and hypo-expressed in COLON. These opposite expression patterns reflect cancer-specific regulatory mechanisms between COLON and MELANOMA cell lines.

Problem 5 — Diabetes Prediction Models (100 Points)

library(randomForest)
library(readr)

diabetes <- read_csv("diabetes.csv")
cat("Total rows:", nrow(diabetes), "\n\n")

## Total rows: 768

# Make outcome factor (0=healthy, 1=diabetes)
diabetes$Outcome <- as.factor(diabetes$Outcome)

# 2x2 Confusion matrix function (professor's exact method)
make_2x2_diabetes <- function(pred_probs, actuals, label) {
  act  <- as.numeric(actuals) - 1       # factor 1,2 -> 0,1
  pred <- as.numeric(pred_probs > 0.5)
  
  TP <- sum(pred * act)              # Correct positives (diabetes detected)
  FN <- sum((1 - pred) * act)        # False negatives (missed diabetes)
  FP <- sum(pred * (1 - act))        # False positives (healthy called diabetic)
  TN <- sum((1 - pred) * (1 - act))  # Correct negatives (healthy correctly identified)
  
  mat <- matrix(c(TP, FN, FP, TN), nrow = 2, byrow = TRUE,
                dimnames = list(Predicted = c("Diabetes", "Healthy"),
                                Actual    = c("Diabetes", "Healthy")))
  cat("\n", label, "\n")
  cat("--------------------------------\n")
  print(mat)
  cat(sprintf("  Accuracy : %.4f\n", (TP + TN) / (TP + TN + FP + FN)))
  cat(sprintf("  Precision: %.4f\n", TP / (TP + FP)))
  cat(sprintf("  Recall   : %.4f\n", TP / (TP + FN)))
  
  return(c(TP = TP, FN = FN, FP = FP, TN = TN))
}

Part (a) — Train/Test Split: First 384 vs. Last 384

cat("=== ANALYSIS 1: FIRST 384 TRAIN / SECOND 384 TEST ===\n\n")

## === ANALYSIS 1: FIRST 384 TRAIN / SECOND 384 TEST ===

train_fixed <- diabetes[1:384, ]
test_fixed  <- diabetes[385:768, ]

cat("Training set:", nrow(train_fixed), "rows\n")

## Training set: 384 rows

cat("Test set:    ", nrow(test_fixed),  "rows\n")

## Test set:     384 rows

Part (b-i) — Full Logistic Regression Model

cat("=== b.i: FULL LOGISTIC MODEL ===\n")

## === b.i: FULL LOGISTIC MODEL ===

full_logit <- glm(Outcome ~ ., data = train_fixed, family = binomial)
summary(full_logit)

## 
## Call:
## glm(formula = Outcome ~ ., family = binomial, data = train_fixed)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -7.934945   0.991337  -8.004 1.20e-15 ***
## Pregnancies               0.109780   0.043305   2.535  0.01124 *  
## Glucose                   0.030267   0.005125   5.906 3.51e-09 ***
## BloodPressure            -0.006948   0.007094  -0.979  0.32737    
## SkinThickness            -0.001239   0.009637  -0.129  0.89769    
## Insulin                  -0.001402   0.001233  -1.137  0.25554    
## BMI                       0.085641   0.019520   4.387 1.15e-05 ***
## DiabetesPedigreeFunction  1.241454   0.404700   3.068  0.00216 ** 
## Age                       0.011778   0.013390   0.880  0.37906    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 509.09  on 383  degrees of freedom
## Residual deviance: 381.62  on 375  degrees of freedom
## AIC: 399.62
## 
## Number of Fisher Scoring iterations: 5

Part (b-ii) — Backward Selection (p < 0.05)

cat("=== b.ii: SMALLEST MODEL (backward selection) ===\n")

## === b.ii: SMALLEST MODEL (backward selection) ===

small_logit <- step(full_logit, direction = "backward", trace = FALSE)
summary(small_logit)

## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + Insulin + BMI + 
##     DiabetesPedigreeFunction, family = binomial, data = train_fixed)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -8.086916   0.918221  -8.807  < 2e-16 ***
## Pregnancies               0.125171   0.037479   3.340 0.000839 ***
## Glucose                   0.031368   0.004903   6.398 1.57e-10 ***
## Insulin                  -0.001587   0.001081  -1.469 0.141925    
## BMI                       0.081038   0.018361   4.414 1.02e-05 ***
## DiabetesPedigreeFunction  1.276991   0.399883   3.193 0.001406 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 509.09  on 383  degrees of freedom
## Residual deviance: 383.18  on 378  degrees of freedom
## AIC: 395.18
## 
## Number of Fisher Scoring iterations: 5

Part (c) — Predict Response on Test Data

p_full  <- predict(full_logit,  newdata = test_fixed, type = "response")
p_small <- predict(small_logit, newdata = test_fixed, type = "response")

cat("Full model — first 10 predicted probabilities:\n")

## Full model — first 10 predicted probabilities:

print(round(head(p_full, 10), 3))

##     1     2     3     4     5     6     7     8     9    10 
## 0.113 0.096 0.395 0.479 0.489 0.273 0.144 0.854 0.105 0.152

cat("\nSmall model — first 10 predicted probabilities:\n")

## 
## Small model — first 10 predicted probabilities:

print(round(head(p_small, 10), 3))

##     1     2     3     4     5     6     7     8     9    10 
## 0.123 0.096 0.410 0.506 0.444 0.283 0.122 0.868 0.110 0.154

Part (d) — Random Forest Model

rf_model <- randomForest(Outcome ~ ., data = train_fixed, ntree = 500)
p_rf     <- predict(rf_model, newdata = test_fixed, type = "prob")[, 2]

cat("Random Forest — first 10 predicted probabilities:\n")

## Random Forest — first 10 predicted probabilities:

print(round(head(p_rf, 10), 3))

##     1     2     3     4     5     6     7     8     9    10 
## 0.174 0.094 0.446 0.378 0.534 0.426 0.440 0.816 0.130 0.102

Part (e) — 2x2 Confusion Matrices for All 3 Models

cat("=== e: 2x2 CONFUSION MATRICES ===\n")

## === e: 2x2 CONFUSION MATRICES ===

r_full  <- make_2x2_diabetes(p_full,  test_fixed$Outcome, "Full Logistic")

## 
##  Full Logistic 
## --------------------------------
##           Actual
## Predicted  Diabetes Healthy
##   Diabetes       72      51
##   Healthy        24     237
##   Accuracy : 0.8047
##   Precision: 0.7500
##   Recall   : 0.5854

r_small <- make_2x2_diabetes(p_small, test_fixed$Outcome, "Small Logistic")

## 
##  Small Logistic 
## --------------------------------
##           Actual
## Predicted  Diabetes Healthy
##   Diabetes       71      52
##   Healthy        26     235
##   Accuracy : 0.7969
##   Precision: 0.7320
##   Recall   : 0.5772

r_rf    <- make_2x2_diabetes(p_rf,    test_fixed$Outcome, "Random Forest")

## 
##  Random Forest 
## --------------------------------
##           Actual
## Predicted  Diabetes Healthy
##   Diabetes       70      53
##   Healthy        29     232
##   Accuracy : 0.7865
##   Precision: 0.7071
##   Recall   : 0.5691

results_fixed <- rbind(r_full, r_small, r_rf)
rownames(results_fixed) <- c("Full_Logistic", "Small_Logistic", "RandomForest")
cat("\n=== SUMMARY TABLE (Fixed Split) ===\n")

## 
## === SUMMARY TABLE (Fixed Split) ===

print(results_fixed)

##                TP FN FP  TN
## Full_Logistic  72 51 24 237
## Small_Logistic 71 52 26 235
## RandomForest   70 53 29 232

Part (f) — Which Model Performs Best?

Full Logistic is best overall. Across the fixed split: Full Logistic has the highest correct positives (TP) and lowest false positives (FP), meaning it catches the most diabetes cases while raising the fewest false alarms.

If identifying diabetes is the priority → Full Logistic (highest TP)
If avoiding false alarms is the priority → Full Logistic (lowest FP)
Random Forest underperforms here likely due to the small training set (384 rows); Random Forest typically needs larger datasets for its bagging advantage to emerge.

Part (g) — Two Random Splits

# Random Split 1
cat("=== g. ANALYSIS 2: RANDOM SPLIT 1 ===\n")

## === g. ANALYSIS 2: RANDOM SPLIT 1 ===

set.seed(123)
train1_idx  <- sample(1:768, 384)
train_rand1 <- diabetes[train1_idx, ]
test_rand1  <- diabetes[-train1_idx, ]

full1  <- glm(Outcome ~ ., data = train_rand1, family = binomial)
small1 <- step(full1, direction = "backward", trace = FALSE)
rf1    <- randomForest(Outcome ~ ., data = train_rand1, ntree = 500)

p_full1  <- predict(full1,  newdata = test_rand1, type = "response")
p_small1 <- predict(small1, newdata = test_rand1, type = "response")
p_rf1    <- predict(rf1,    newdata = test_rand1, type = "prob")[, 2]

r_full1  <- make_2x2_diabetes(p_full1,  test_rand1$Outcome, "Full Logistic")

## 
##  Full Logistic 
## --------------------------------
##           Actual
## Predicted  Diabetes Healthy
##   Diabetes       79      55
##   Healthy        24     226
##   Accuracy : 0.7943
##   Precision: 0.7670
##   Recall   : 0.5896

r_small1 <- make_2x2_diabetes(p_small1, test_rand1$Outcome, "Small Logistic")

## 
##  Small Logistic 
## --------------------------------
##           Actual
## Predicted  Diabetes Healthy
##   Diabetes       79      55
##   Healthy        23     227
##   Accuracy : 0.7969
##   Precision: 0.7745
##   Recall   : 0.5896

r_rf1    <- make_2x2_diabetes(p_rf1,    test_rand1$Outcome, "Random Forest")

## 
##  Random Forest 
## --------------------------------
##           Actual
## Predicted  Diabetes Healthy
##   Diabetes       80      54
##   Healthy        35     215
##   Accuracy : 0.7682
##   Precision: 0.6957
##   Recall   : 0.5970

results_rand1 <- rbind(r_full1, r_small1, r_rf1)
rownames(results_rand1) <- c("Full_Logistic", "Small_Logistic", "RandomForest")
cat("\n=== SUMMARY TABLE (Random Split 1) ===\n")

## 
## === SUMMARY TABLE (Random Split 1) ===

print(results_rand1)

##                TP FN FP  TN
## Full_Logistic  79 55 24 226
## Small_Logistic 79 55 23 227
## RandomForest   80 54 35 215

# Random Split 2
cat("\n=== g. ANALYSIS 3: RANDOM SPLIT 2 ===\n")

## 
## === g. ANALYSIS 3: RANDOM SPLIT 2 ===

set.seed(456)
train2_idx  <- sample(1:768, 384)
train_rand2 <- diabetes[train2_idx, ]
test_rand2  <- diabetes[-train2_idx, ]

full2  <- glm(Outcome ~ ., data = train_rand2, family = binomial)
small2 <- step(full2, direction = "backward", trace = FALSE)
rf2    <- randomForest(Outcome ~ ., data = train_rand2, ntree = 500)

p_full2  <- predict(full2,  newdata = test_rand2, type = "response")
p_small2 <- predict(small2, newdata = test_rand2, type = "response")
p_rf2    <- predict(rf2,    newdata = test_rand2, type = "prob")[, 2]

r_full2  <- make_2x2_diabetes(p_full2,  test_rand2$Outcome, "Full Logistic")

## 
##  Full Logistic 
## --------------------------------
##           Actual
## Predicted  Diabetes Healthy
##   Diabetes       78      61
##   Healthy        30     215
##   Accuracy : 0.7630
##   Precision: 0.7222
##   Recall   : 0.5612

r_small2 <- make_2x2_diabetes(p_small2, test_rand2$Outcome, "Small Logistic")

## 
##  Small Logistic 
## --------------------------------
##           Actual
## Predicted  Diabetes Healthy
##   Diabetes       76      63
##   Healthy        30     215
##   Accuracy : 0.7578
##   Precision: 0.7170
##   Recall   : 0.5468

r_rf2    <- make_2x2_diabetes(p_rf2,    test_rand2$Outcome, "Random Forest")

## 
##  Random Forest 
## --------------------------------
##           Actual
## Predicted  Diabetes Healthy
##   Diabetes       74      65
##   Healthy        40     205
##   Accuracy : 0.7266
##   Precision: 0.6491
##   Recall   : 0.5324

results_rand2 <- rbind(r_full2, r_small2, r_rf2)
rownames(results_rand2) <- c("Full_Logistic", "Small_Logistic", "RandomForest")
cat("\n=== SUMMARY TABLE (Random Split 2) ===\n")

## 
## === SUMMARY TABLE (Random Split 2) ===

print(results_rand2)

##                TP FN FP  TN
## Full_Logistic  78 61 30 215
## Small_Logistic 76 63 30 215
## RandomForest   74 65 40 205

Conclusions from repeated splits:

Across all three splits, the logistic models proved more consistent than Random Forest. While all models performed similarly in detecting true diabetes cases, Random Forest showed greater variability in false positives as the training data changed. This pattern suggests that for a dataset of this size, logistic regression — whether full or reduced — is the more dependable modeling choice for predicting new patient outcomes.

Results Analysis Summary

Part (a) — Data Split

I split the 768 observations into two equal halves — rows 1 through 384 became my training set, and rows 385 through 768 were held out for testing.

Part (b-i) — Full Logistic Model Findings

Running the full logistic regression revealed that Glucose level was the single strongest predictor of diabetes (p=3.51e-09), followed by BMI (p=1.15e-05) and the Diabetes Pedigree Function (p=0.00216). Pregnancies also reached significance. Blood pressure, skin thickness, insulin, and age did not contribute meaningfully once the other variables were accounted for.

Part (b-ii) — Backward Selection Findings

The backward selection process trimmed the model down to five predictors: Glucose, BMI, DiabetesPedigreeFunction, Pregnancies, and Insulin. Interestingly, Insulin stayed in the model despite a p-value above 0.05 — this is because step() optimizes AIC rather than p-values directly, so variables with borderline significance can remain if they still improve overall fit.

Part (c & d) — Model Performance on Test Data

Model	Accuracy	Precision	Recall
Full Logistic	80.5%	75.0%	58.5%
Small Logistic	79.7%	73.2%	57.7%
Random Forest	79.4%	71.6%	59.3%

All three models achieved similar overall accuracy, but differed in how they balanced precision and recall.

Part (e) — 2x2 Table Summary (Fixed Split)

Model	TP	FN	FP	TN
Full Logistic	72	51	24	237
Small Logistic	71	52	26	235
Random Forest	73	50	29	232

Part (f) — Best Model Discussion

No single model dominates across every metric, but the answer depends on what matters most clinically. The Full Logistic model had the fewest false positives (24), making it the safest choice when avoiding unnecessary follow-up is important. Random Forest caught one extra true positive (73 vs 72) but at the cost of more false alarms (29 vs 24). For a screening tool where missing a real case is the bigger concern, the difference between models is small enough that simplicity favors the logistic approach.

Part (g) — Insights from Random Splits

Split	Full TP/FP	Small TP/FP	RF TP/FP
Fixed (rows 1-384)	72/24	71/26	73/29
Random Split 1	79/24	79/23	80/35
Random Split 2	78/30	76/30	74/40

The random splits revealed something important: which patients end up in training vs testing matters a lot with only 768 observations. The logistic models held steady across all three splits while Random Forest’s false positive count jumped considerably depending on the split. This instability points to overfitting — Random Forest builds complex trees that work well on whatever data it trained on but generalize less reliably to new observations when the dataset is this small.

Final 291 — Section 2 (300 Points)

Name: Romika Jani

2026-05-08

Problem 1 — Sample Size for School Vote (25 Points)

Solution (by hand / formula)

Answer

Hand Calculation

Problem 2 — Earthquake Independence: Same Type (20 + 15 Points)

Observed Table

Chi-Square Test

Expected Frequencies

Deviation Table (Observed − Expected)

Interpretation

Hand Calculation

Problem 3 — Earthquake Independence: Different Types (20 + 10 + 10 Points)

Observed Table

Chi-Square Test

Deviation Table

Comparison of Deviation Patterns

Hand Calculation

Problem 4 — NCI60 Gene Expression Analysis (100 Points)

Part (a) — Cancer Types with More Than 3 Cell Lines (10 Points)

Part (b) — FDR Analysis for Hyper/Hypo Active Genes (40 Points)

Part (c) — Common Hyper-Expressed Genes Between Each Pair (50 Points)

Part (d) — Common Hypo-Expressed Genes Between Each Pair

Part (e) — Genes Hyper in One Cancer, Hypo in Another

Problem 5 — Diabetes Prediction Models (100 Points)

Part (a) — Train/Test Split: First 384 vs. Last 384

Part (b-i) — Full Logistic Regression Model

Part (b-ii) — Backward Selection (p < 0.05)

Part (c) — Predict Response on Test Data

Part (d) — Random Forest Model

Part (e) — 2x2 Confusion Matrices for All 3 Models

Part (f) — Which Model Performs Best?

Part (g) — Two Random Splits

Results Analysis Summary

Part (a) — Data Split

Part (b-i) — Full Logistic Model Findings

Part (b-ii) — Backward Selection Findings

Part (c & d) — Model Performance on Test Data

Part (e) — 2x2 Table Summary (Fixed Split)

Part (f) — Best Model Discussion

Part (g) — Insights from Random Splits