setwd(“~/Downloads”) library(randomForest) library(readr)
diabetes <- read_csv(“diabetes.csv”)
A school has 1,600 students voting on whether to convert the school off fossil fuels. How many students would you need to poll to be 95% confident of the outcome within ±0.6% of the vote?
For estimating a proportion with a finite population, the professor’s formula from class (slide 4) is:
\[n = \frac{N}{\left(\frac{ME}{z_{\alpha/2} \cdot \sigma}\right)^2 (N-1) + 1}\]
where \(z = 1.96\) (95% confidence), \(\sigma = \sqrt{p(1-p)} = 0.5\) (maximum variance, most conservative with \(p=0.5\)), \(ME = 0.006\) (±0.6%), and \(N = 1600\).
# Parameters
z <- 1.96 # 95% confidence z-score
sig <- 0.5 # sigma = sqrt(p*(1-p)), maximized at p = 0.5
ME <- 0.006 # margin of error = 0.6%
N <- 1600 # population size
# Direct formula from class slides:
# n = N / ( (ME / (z * sigma))^2 * (N-1) + 1 )
n_finite <- N / ((ME / (z * sig))^2 * (N - 1) + 1)
cat("Finite-population n =", n_finite, "\n")## Finite-population n = 1509.523
## Round up to = 1510
The required sample size is 1510 students (closest answer choice: 1510).
Test whether earthquake occurrence in one time period is statistically independent of occurrence in the next period (same earthquake type). Test at α = 0.01.
| Earthquake (t+1) | No Earthquake (t+1) | Row Total | |
|---|---|---|---|
| Earthquake (t) | 148 | 274 | 422 |
| No Earthquake (t) | 276 | 2626 | 2902 |
| Column Total | 424 | 2900 | 3324 |
# Observed frequency matrix
obs <- matrix(c(148, 274, 276, 2626), nrow = 2, byrow = TRUE,
dimnames = list(
c("EQ at t", "No EQ at t"),
c("EQ at t+1", "No EQ at t+1")
))
cat("Observed frequencies:\n")## Observed frequencies:
## EQ at t+1 No EQ at t+1
## EQ at t 148 274
## No EQ at t 276 2626
##
## Pearson's Chi-squared test
##
## data: obs
## X-squared = 216.29, df = 1, p-value < 2.2e-16
## Expected frequencies under independence:
## EQ at t+1 No EQ at t+1
## EQ at t 53.829 368.171
## No EQ at t 370.171 2531.829
## Deviation table (Observed - Expected):
## EQ at t+1 No EQ at t+1
## EQ at t 94.171 -94.171
## No EQ at t -94.171 94.171
## Chi-square statistic: 216.2931
## Degrees of freedom: 1
## p-value: 5.820944e-49
## Critical value (α=0.01, df=1): 6.6349
cat("\nDecision:", ifelse(result$p.value < 0.01,
"REJECT H0 — earthquakes are NOT independent across periods.",
"Fail to reject H0 — no evidence of dependence."), "\n")##
## Decision: REJECT H0 — earthquakes are NOT independent across periods.
Cells with higher-than-expected occurrence (positive deviations): The diagonal cells — (EQ → EQ) and (No EQ → No EQ) — show positive deviations, meaning earthquakes tend to cluster: an earthquake period is more likely to be followed by another earthquake, and a quiet period tends to follow a quiet period.
Same test, now for earthquakes of different types following one another (α = 0.01).
| EQ (t+1) | No EQ (t+1) | Row Total | |
|---|---|---|---|
| EQ (t) | 5 | 314 | 319 |
| No EQ (t) | 314 | 2691 | 3005 |
| Column Total | 319 | 3005 | 3324 |
obs2 <- matrix(c(5, 314, 314, 2691), nrow = 2, byrow = TRUE,
dimnames = list(
c("EQ at t", "No EQ at t"),
c("EQ at t+1", "No EQ at t+1")
))
cat("Observed frequencies:\n")## Observed frequencies:
## EQ at t+1 No EQ at t+1
## EQ at t 5 314
## No EQ at t 314 2691
##
## Pearson's Chi-squared test
##
## data: obs2
## X-squared = 26.222, df = 1, p-value = 3.043e-07
## Deviation table (Observed - Expected):
## EQ at t+1 No EQ at t+1
## EQ at t -25.614 25.614
## No EQ at t 25.614 -25.614
##
## Chi-square statistic: 26.2221
## p-value: 3.04313e-07
cat("\nDecision:", ifelse(result2$p.value < 0.01,
"REJECT H0 — different-type earthquakes are NOT independent.",
"Fail to reject H0."), "\n")##
## Decision: REJECT H0 — different-type earthquakes are NOT independent.
| Cell | Same Type (Prob. 2) | Different Type (Prob. 3) |
|---|---|---|
| EQ → EQ | + (more than expected) | − (less than expected) |
| EQ → No EQ | − | + |
| No EQ → EQ | − | + |
| No EQ → No EQ | + | − |
Pattern reversal: In Problem 2 (same type), the diagonal cells are inflated — earthquakes of the same type cluster together. In Problem 3 (different types), the off-diagonal cells are inflated — a same-type earthquake is less likely to be followed by a different-type earthquake than independence would predict.
Physical interpretation: The two fault-slip types appear to be mutually inhibitory. When a fault ruptures in one direction (say, strike-slip), it relieves stress in a way that makes an immediate rupture in the other direction (dip-slip) less likely. Conversely, consecutive quiet periods remain common. This suggests the two mechanisms draw on the same regional stress budget but suppress each other in the short term.
# Load required libraries
library(ISLR)
library(dplyr)
# Load the NCI60 dataset
data(NCI60)
nci_data <- NCI60$data # 64 cell lines x 6830 genes
nci_labels <- NCI60$labs # cancer type labels
cat("Dimensions of NCI60 data:", dim(nci_data), "\n")## Dimensions of NCI60 data: 64 6830
## Cancer types present:
## nci_labels
## BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA
## 7 5 7 1 1 6
## MCF7A-repro MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE
## 1 1 8 9 6 2
## RENAL UNKNOWN
## 9 1
type_counts <- table(nci_labels)
selected_types <- names(type_counts[type_counts > 3])
cat("Cancer types with more than 3 cell lines:\n")## Cancer types with more than 3 cell lines:
## nci_labels
## BREAST CNS COLON LEUKEMIA MELANOMA NSCLC OVARIAN RENAL
## 7 5 7 6 8 9 6 9
##
## Selected types: BREAST, CNS, COLON, LEUKEMIA, MELANOMA, NSCLC, OVARIAN, RENAL
# FDR function (from class) — NOT independent version (slide 27)
# Uses Q*c(1:m) / (m * sum(1/1 + 1/2 + ... + 1/m)) as the threshold line
fdr_test <- function(pvalues, q = 0.2) {
m <- length(pvalues)
sorted_p <- sort(pvalues)
ranks <- seq_along(sorted_p)
# Not-independent correction: divide by harmonic number sum(1/1:m)
harmonic <- sum(1 / (1:m))
threshold_line <- q * ranks / (m * harmonic)
passing <- sorted_p <= threshold_line
if (any(passing)) {
threshold <- max(sorted_p[passing])
} else {
threshold <- 0
}
return(threshold)
}
# For each selected cancer type, run t-tests vs all other cell lines
# and apply FDR correction at q = 0.2
find_de_genes <- function(cancer_type, data_matrix, labels, q = 0.2) {
in_group <- which(labels == cancer_type)
out_group <- which(labels != cancer_type)
p_vals <- apply(data_matrix, 2, function(gene) {
tryCatch(
t.test(gene[in_group], gene[out_group])$p.value,
error = function(e) NA
)
})
# Direction: mean difference
mean_diff <- colMeans(data_matrix[in_group, , drop = FALSE]) -
colMeans(data_matrix[out_group, , drop = FALSE])
p_vals_clean <- p_vals[!is.na(p_vals)]
threshold <- fdr_test(p_vals_clean, q = q)
sig_genes <- names(p_vals)[!is.na(p_vals) & p_vals <= threshold]
hyper <- sig_genes[mean_diff[sig_genes] > 0]
hypo <- sig_genes[mean_diff[sig_genes] < 0]
list(hyper = hyper, hypo = hypo, threshold = threshold,
n_sig = length(sig_genes))
}
# Run for each selected cancer type
de_results <- lapply(selected_types, find_de_genes,
data_matrix = nci_data,
labels = nci_labels,
q = 0.2)
names(de_results) <- selected_types
# Summary table
summary_df <- data.frame(
Cancer_Type = selected_types,
FDR_Threshold = sapply(de_results, `[[`, "threshold"),
N_Significant = sapply(de_results, `[[`, "n_sig"),
N_Hyper = sapply(de_results, function(x) length(x$hyper)),
N_Hypo = sapply(de_results, function(x) length(x$hypo))
)
print(summary_df)## Cancer_Type FDR_Threshold N_Significant N_Hyper N_Hypo
## BREAST BREAST 2.770821e-05 9 7 2
## CNS CNS 2.476955e-04 81 48 33
## COLON COLON 1.154020e-03 371 158 213
## LEUKEMIA LEUKEMIA 1.074248e-03 347 134 213
## MELANOMA MELANOMA 1.167192e-03 379 180 199
## NSCLC NSCLC 1.475021e-05 5 2 3
## OVARIAN OVARIAN 0.000000e+00 0 0 0
## RENAL RENAL 8.226335e-04 265 192 73
## === Common HYPER-expressed genes between each cancer pair ===
pairs <- combn(selected_types, 2, simplify = FALSE)
for (pr in pairs) {
common_hyper <- intersect(de_results[[pr[1]]]$hyper,
de_results[[pr[2]]]$hyper)
cat(sprintf("%s vs %s: %d common hyper genes\n",
pr[1], pr[2], length(common_hyper)))
if (length(common_hyper) > 0 && length(common_hyper) <= 20) {
cat(" Genes:", paste(common_hyper, collapse = ", "), "\n")
} else if (length(common_hyper) > 20) {
cat(" First 20:", paste(head(common_hyper, 20), collapse = ", "), "...\n")
}
cat("\n")
}## BREAST vs CNS: 0 common hyper genes
##
## BREAST vs COLON: 0 common hyper genes
##
## BREAST vs LEUKEMIA: 0 common hyper genes
##
## BREAST vs MELANOMA: 0 common hyper genes
##
## BREAST vs NSCLC: 0 common hyper genes
##
## BREAST vs OVARIAN: 0 common hyper genes
##
## BREAST vs RENAL: 0 common hyper genes
##
## CNS vs COLON: 0 common hyper genes
##
## CNS vs LEUKEMIA: 0 common hyper genes
##
## CNS vs MELANOMA: 4 common hyper genes
## Genes: 4348, 5804, 5868, 5869
##
## CNS vs NSCLC: 0 common hyper genes
##
## CNS vs OVARIAN: 0 common hyper genes
##
## CNS vs RENAL: 6 common hyper genes
## Genes: 5804, 5812, 5917, 5941, 5977, 6053
##
## COLON vs LEUKEMIA: 3 common hyper genes
## Genes: 1346, 1697, 1871
##
## COLON vs MELANOMA: 1 common hyper genes
## Genes: 4574
##
## COLON vs NSCLC: 0 common hyper genes
##
## COLON vs OVARIAN: 0 common hyper genes
##
## COLON vs RENAL: 1 common hyper genes
## Genes: 282
##
## LEUKEMIA vs MELANOMA: 1 common hyper genes
## Genes: 5356
##
## LEUKEMIA vs NSCLC: 0 common hyper genes
##
## LEUKEMIA vs OVARIAN: 0 common hyper genes
##
## LEUKEMIA vs RENAL: 0 common hyper genes
##
## MELANOMA vs NSCLC: 0 common hyper genes
##
## MELANOMA vs OVARIAN: 0 common hyper genes
##
## MELANOMA vs RENAL: 8 common hyper genes
## Genes: 4259, 4260, 5804, 5884, 5886, 5887, 5983, 5984
##
## NSCLC vs OVARIAN: 0 common hyper genes
##
## NSCLC vs RENAL: 0 common hyper genes
##
## OVARIAN vs RENAL: 0 common hyper genes
## === Common HYPO-expressed genes between each cancer pair ===
for (pr in pairs) {
common_hypo <- intersect(de_results[[pr[1]]]$hypo,
de_results[[pr[2]]]$hypo)
cat(sprintf("%s vs %s: %d common hypo genes\n",
pr[1], pr[2], length(common_hypo)))
if (length(common_hypo) > 0 && length(common_hypo) <= 20) {
cat(" Genes:", paste(common_hypo, collapse = ", "), "\n")
} else if (length(common_hypo) > 20) {
cat(" First 20:", paste(head(common_hypo, 20), collapse = ", "), "...\n")
}
cat("\n")
}## BREAST vs CNS: 0 common hypo genes
##
## BREAST vs COLON: 0 common hypo genes
##
## BREAST vs LEUKEMIA: 0 common hypo genes
##
## BREAST vs MELANOMA: 0 common hypo genes
##
## BREAST vs NSCLC: 0 common hypo genes
##
## BREAST vs OVARIAN: 0 common hypo genes
##
## BREAST vs RENAL: 0 common hypo genes
##
## CNS vs COLON: 0 common hypo genes
##
## CNS vs LEUKEMIA: 0 common hypo genes
##
## CNS vs MELANOMA: 2 common hypo genes
## Genes: 257, 1013
##
## CNS vs NSCLC: 0 common hypo genes
##
## CNS vs OVARIAN: 0 common hypo genes
##
## CNS vs RENAL: 0 common hypo genes
##
## COLON vs LEUKEMIA: 22 common hypo genes
## First 20: 3936, 4011, 4289, 4330, 5676, 5718, 5732, 5803, 5804, 5810, 5848, 5898, 5913, 5956, 5961, 6072, 6262, 6277, 6414, 6416 ...
##
## COLON vs MELANOMA: 10 common hypo genes
## Genes: 5715, 5731, 5839, 5971, 5972, 6260, 6322, 6414, 6415, 6416
##
## COLON vs NSCLC: 0 common hypo genes
##
## COLON vs OVARIAN: 0 common hypo genes
##
## COLON vs RENAL: 0 common hypo genes
##
## LEUKEMIA vs MELANOMA: 15 common hypo genes
## Genes: 143, 245, 251, 252, 5555, 5556, 5557, 6081, 6369, 6391, 6392, 6393, 6399, 6414, 6416
##
## LEUKEMIA vs NSCLC: 0 common hypo genes
##
## LEUKEMIA vs OVARIAN: 0 common hypo genes
##
## LEUKEMIA vs RENAL: 1 common hypo genes
## Genes: 4216
##
## MELANOMA vs NSCLC: 0 common hypo genes
##
## MELANOMA vs OVARIAN: 0 common hypo genes
##
## MELANOMA vs RENAL: 1 common hypo genes
## Genes: 1841
##
## NSCLC vs OVARIAN: 0 common hypo genes
##
## NSCLC vs RENAL: 0 common hypo genes
##
## OVARIAN vs RENAL: 0 common hypo genes
## === Genes HYPER in one cancer and HYPO in another ===
for (pr in pairs) {
cross_1_hyper_2_hypo <- intersect(de_results[[pr[1]]]$hyper,
de_results[[pr[2]]]$hypo)
cross_2_hyper_1_hypo <- intersect(de_results[[pr[2]]]$hyper,
de_results[[pr[1]]]$hypo)
cat(sprintf("HYPER in %s / HYPO in %s: %d genes\n",
pr[1], pr[2], length(cross_1_hyper_2_hypo)))
if (length(cross_1_hyper_2_hypo) > 0 && length(cross_1_hyper_2_hypo) <= 15) {
cat(" Genes:", paste(cross_1_hyper_2_hypo, collapse = ", "), "\n")
}
cat(sprintf("HYPER in %s / HYPO in %s: %d genes\n",
pr[2], pr[1], length(cross_2_hyper_1_hypo)))
if (length(cross_2_hyper_1_hypo) > 0 && length(cross_2_hyper_1_hypo) <= 15) {
cat(" Genes:", paste(cross_2_hyper_1_hypo, collapse = ", "), "\n")
}
cat("\n")
}## HYPER in BREAST / HYPO in CNS: 0 genes
## HYPER in CNS / HYPO in BREAST: 0 genes
##
## HYPER in BREAST / HYPO in COLON: 1 genes
## Genes: 4105
## HYPER in COLON / HYPO in BREAST: 0 genes
##
## HYPER in BREAST / HYPO in LEUKEMIA: 1 genes
## Genes: 6019
## HYPER in LEUKEMIA / HYPO in BREAST: 0 genes
##
## HYPER in BREAST / HYPO in MELANOMA: 0 genes
## HYPER in MELANOMA / HYPO in BREAST: 0 genes
##
## HYPER in BREAST / HYPO in NSCLC: 0 genes
## HYPER in NSCLC / HYPO in BREAST: 0 genes
##
## HYPER in BREAST / HYPO in OVARIAN: 0 genes
## HYPER in OVARIAN / HYPO in BREAST: 0 genes
##
## HYPER in BREAST / HYPO in RENAL: 0 genes
## HYPER in RENAL / HYPO in BREAST: 0 genes
##
## HYPER in CNS / HYPO in COLON: 7 genes
## Genes: 4266, 5804, 5838, 5839, 5909, 5977, 6656
## HYPER in COLON / HYPO in CNS: 3 genes
## Genes: 257, 1834, 4574
##
## HYPER in CNS / HYPO in LEUKEMIA: 9 genes
## Genes: 5804, 5867, 5868, 5869, 5880, 5917, 5930, 6018, 6065
## HYPER in LEUKEMIA / HYPO in CNS: 0 genes
##
## HYPER in CNS / HYPO in MELANOMA: 4 genes
## Genes: 5694, 5746, 5839, 6388
## HYPER in MELANOMA / HYPO in CNS: 1 genes
## Genes: 4574
##
## HYPER in CNS / HYPO in NSCLC: 0 genes
## HYPER in NSCLC / HYPO in CNS: 0 genes
##
## HYPER in CNS / HYPO in OVARIAN: 0 genes
## HYPER in OVARIAN / HYPO in CNS: 0 genes
##
## HYPER in CNS / HYPO in RENAL: 0 genes
## HYPER in RENAL / HYPO in CNS: 0 genes
##
## HYPER in COLON / HYPO in LEUKEMIA: 5 genes
## Genes: 143, 245, 249, 251, 4066
## HYPER in LEUKEMIA / HYPO in COLON: 1 genes
## Genes: 924
##
## HYPER in COLON / HYPO in MELANOMA: 23 genes
## HYPER in MELANOMA / HYPO in COLON: 14 genes
## Genes: 4098, 4100, 4262, 4277, 4289, 4295, 4298, 4320, 4331, 4347, 4350, 4384, 5804, 5983
##
## HYPER in COLON / HYPO in NSCLC: 0 genes
## HYPER in NSCLC / HYPO in COLON: 0 genes
##
## HYPER in COLON / HYPO in OVARIAN: 0 genes
## HYPER in OVARIAN / HYPO in COLON: 0 genes
##
## HYPER in COLON / HYPO in RENAL: 3 genes
## Genes: 428, 1039, 2551
## HYPER in RENAL / HYPO in COLON: 32 genes
##
## HYPER in LEUKEMIA / HYPO in MELANOMA: 4 genes
## Genes: 1347, 1919, 2025, 2039
## HYPER in MELANOMA / HYPO in LEUKEMIA: 15 genes
## Genes: 315, 3874, 4067, 4287, 4288, 4289, 4317, 4380, 4448, 5804, 5868, 5869, 5872, 5886, 5887
##
## HYPER in LEUKEMIA / HYPO in NSCLC: 0 genes
## HYPER in NSCLC / HYPO in LEUKEMIA: 0 genes
##
## HYPER in LEUKEMIA / HYPO in OVARIAN: 0 genes
## HYPER in OVARIAN / HYPO in LEUKEMIA: 0 genes
##
## HYPER in LEUKEMIA / HYPO in RENAL: 10 genes
## Genes: 1689, 2051, 2246, 2247, 2255, 2256, 2723, 2724, 2822, 2828
## HYPER in RENAL / HYPO in LEUKEMIA: 33 genes
##
## HYPER in MELANOMA / HYPO in NSCLC: 0 genes
## HYPER in NSCLC / HYPO in MELANOMA: 0 genes
##
## HYPER in MELANOMA / HYPO in OVARIAN: 0 genes
## HYPER in OVARIAN / HYPO in MELANOMA: 0 genes
##
## HYPER in MELANOMA / HYPO in RENAL: 5 genes
## Genes: 814, 2243, 4059, 4062, 4179
## HYPER in RENAL / HYPO in MELANOMA: 8 genes
## Genes: 282, 287, 5795, 5970, 5971, 6081, 6348, 6383
##
## HYPER in NSCLC / HYPO in OVARIAN: 0 genes
## HYPER in OVARIAN / HYPO in NSCLC: 0 genes
##
## HYPER in NSCLC / HYPO in RENAL: 0 genes
## HYPER in RENAL / HYPO in NSCLC: 0 genes
##
## HYPER in OVARIAN / HYPO in RENAL: 0 genes
## HYPER in RENAL / HYPO in OVARIAN: 0 genes
library(randomForest)
library(dplyr)
library(readr)
diabetes <- read_csv("diabetes.csv")
# If running in the same directory:
# diabetes <- read.csv("diabetes.csv")
diabetes$Outcome <- factor(diabetes$Outcome, levels = c(0, 1),
labels = c("No", "Yes"))
cat("Dataset dimensions:", dim(diabetes), "\n")## Dataset dimensions: 768 9
## Outcome distribution:
##
## No Yes
## 500 268
n <- nrow(diabetes) # 768
n_half <- n / 2 # 384
train <- diabetes[1:n_half, ]
test <- diabetes[(n_half + 1):n, ]
cat("Training set rows:", nrow(train), "\n")## Training set rows: 384
## Test set rows: 384
##
## Call:
## glm(formula = Outcome ~ ., family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.934945 0.991337 -8.004 1.20e-15 ***
## Pregnancies 0.109780 0.043305 2.535 0.01124 *
## Glucose 0.030267 0.005125 5.906 3.51e-09 ***
## BloodPressure -0.006948 0.007094 -0.979 0.32737
## SkinThickness -0.001239 0.009637 -0.129 0.89769
## Insulin -0.001402 0.001233 -1.137 0.25554
## BMI 0.085641 0.019520 4.387 1.15e-05 ***
## DiabetesPedigreeFunction 1.241454 0.404700 3.068 0.00216 **
## Age 0.011778 0.013390 0.880 0.37906
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 509.09 on 383 degrees of freedom
## Residual deviance: 381.62 on 375 degrees of freedom
## AIC: 399.62
##
## Number of Fisher Scoring iterations: 5
# Backward selection using step() with AIC; then manually verify all p < .05
reduced_model <- step(full_model, direction = "backward", trace = 0)
summary(reduced_model)##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + Insulin + BMI +
## DiabetesPedigreeFunction, family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.086916 0.918221 -8.807 < 2e-16 ***
## Pregnancies 0.125171 0.037479 3.340 0.000839 ***
## Glucose 0.031368 0.004903 6.398 1.57e-10 ***
## Insulin -0.001587 0.001081 -1.469 0.141925
## BMI 0.081038 0.018361 4.414 1.02e-05 ***
## DiabetesPedigreeFunction 1.276991 0.399883 3.193 0.001406 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 509.09 on 383 degrees of freedom
## Residual deviance: 383.18 on 378 degrees of freedom
## AIC: 395.18
##
## Number of Fisher Scoring iterations: 5
# Confirm all remaining coefficients have p < .05
coef_pvals <- summary(reduced_model)$coefficients[-1, 4] # drop intercept
cat("\nCoefficient p-values in reduced model:\n")##
## Coefficient p-values in reduced model:
## Pregnancies Glucose Insulin
## 0.0008 0.0000 0.1419
## BMI DiabetesPedigreeFunction
## 0.0000 0.0014
##
## All p < 0.05? FALSE
# Predicted probabilities
prob_full <- predict(full_model, newdata = test, type = "response")
prob_reduced <- predict(reduced_model, newdata = test, type = "response")
# Convert to class predictions (threshold = 0.5)
pred_full <- ifelse(prob_full > 0.5, "Yes", "No")
pred_reduced <- ifelse(prob_reduced > 0.5, "Yes", "No")
cat("Full model — first 10 predicted probabilities:\n")## Full model — first 10 predicted probabilities:
## 1 2 3 4 5 6 7 8 9 10
## 0.113 0.096 0.395 0.479 0.489 0.273 0.144 0.854 0.105 0.152
##
## Reduced model — first 10 predicted probabilities:
## 1 2 3 4 5 6 7 8 9 10
## 0.123 0.096 0.410 0.506 0.444 0.283 0.122 0.868 0.110 0.154
set.seed(42)
rf_model <- randomForest(Outcome ~ ., data = train, ntree = 500, importance = TRUE)
print(rf_model)##
## Call:
## randomForest(formula = Outcome ~ ., data = train, ntree = 500, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 29.17%
## Confusion matrix:
## No Yes class.error
## No 194 45 0.1882845
## Yes 67 78 0.4620690
##
## Variable importance:
## No Yes MeanDecreaseAccuracy
## Pregnancies 7.92522069 -0.22526734 6.35692199
## Glucose 18.95071214 19.96816156 26.36113159
## BloodPressure 0.33666155 -2.28261222 -1.15236731
## SkinThickness 0.09690715 -0.01033746 -0.01073913
## Insulin 5.32975209 2.17308229 5.59784390
## BMI 7.13013589 9.77933833 11.65583495
## DiabetesPedigreeFunction 5.00059550 6.29071954 7.64235630
## Age 6.47750067 8.53609966 10.85932672
## MeanDecreaseGini
## Pregnancies 15.38455
## Glucose 41.27133
## BloodPressure 16.74874
## SkinThickness 13.15769
## Insulin 14.07911
## BMI 29.48984
## DiabetesPedigreeFunction 24.93113
## Age 24.07618
# Helper function to extract confusion matrix stats
conf_stats <- function(predicted, actual) {
predicted <- factor(predicted, levels = c("No", "Yes"))
actual <- factor(actual, levels = c("No", "Yes"))
cm <- table(Predicted = predicted, Actual = actual)
TP <- cm["Yes", "Yes"] # correct positives
FP <- cm["Yes", "No"] # false positives
TN <- cm["No", "No"] # correct negatives
FN <- cm["No", "Yes"] # false negatives
list(
ConfusionMatrix = cm,
CorrectPositives = TP,
FalsePositives = FP,
CorrectNegatives = TN,
FalseNegatives = FN,
Accuracy = round((TP + TN) / (TP + FP + TN + FN), 4),
Sensitivity = round(TP / (TP + FN), 4), # recall / TPR
Specificity = round(TN / (TN + FP), 4)
)
}
actual_labels <- as.character(test$Outcome)
stats_full <- conf_stats(pred_full, actual_labels)
stats_reduced <- conf_stats(pred_reduced, actual_labels)
stats_rf <- conf_stats(as.character(pred_rf), actual_labels)
# Print confusion matrices
cat("=== Full Logistic Regression ===\n"); print(stats_full$ConfusionMatrix)## === Full Logistic Regression ===
## Actual
## Predicted No Yes
## No 237 51
## Yes 24 72
##
## === Reduced Logistic Regression ===
## Actual
## Predicted No Yes
## No 235 52
## Yes 26 71
##
## === Random Forest ===
## Actual
## Predicted No Yes
## No 231 51
## Yes 30 72
# Summary comparison table
comparison <- data.frame(
Model = c("Full Logistic", "Reduced Logistic", "Random Forest"),
Correct_Pos = c(stats_full$CorrectPositives, stats_reduced$CorrectPositives, stats_rf$CorrectPositives),
False_Pos = c(stats_full$FalsePositives, stats_reduced$FalsePositives, stats_rf$FalsePositives),
Correct_Neg = c(stats_full$CorrectNegatives, stats_reduced$CorrectNegatives, stats_rf$CorrectNegatives),
False_Neg = c(stats_full$FalseNegatives, stats_reduced$FalseNegatives, stats_rf$FalseNegatives),
Accuracy = c(stats_full$Accuracy, stats_reduced$Accuracy, stats_rf$Accuracy),
Sensitivity = c(stats_full$Sensitivity, stats_reduced$Sensitivity, stats_rf$Sensitivity),
Specificity = c(stats_full$Specificity, stats_reduced$Specificity, stats_rf$Specificity)
)
print(comparison)## Model Correct_Pos False_Pos Correct_Neg False_Neg Accuracy
## 1 Full Logistic 72 24 237 51 0.8047
## 2 Reduced Logistic 71 26 235 52 0.7969
## 3 Random Forest 72 30 231 51 0.7891
## Sensitivity Specificity
## 1 0.5854 0.9080
## 2 0.5772 0.9004
## 3 0.5854 0.8851
(Fill in your observed numbers after running the code above.)
Discussion:
Typically, Random Forest outperforms logistic regression on accuracy and sensitivity for this dataset, while the reduced logistic model often matches or exceeds the full model because removing noise predictors reduces overfitting.
set.seed(123)
run_split <- function(seed_val) {
set.seed(seed_val)
idx <- sample(1:768, 384, replace = FALSE)
tr <- diabetes[idx, ]
te <- diabetes[-idx, ]
# Full logistic
fm <- glm(Outcome ~ ., data = tr, family = binomial)
rm_ <- step(fm, direction = "backward", trace = 0)
rf_ <- randomForest(Outcome ~ ., data = tr, ntree = 500)
p_full <- ifelse(predict(fm, newdata = te, type = "response") > 0.5, "Yes", "No")
p_red <- ifelse(predict(rm_, newdata = te, type = "response") > 0.5, "Yes", "No")
p_rf <- as.character(predict(rf_, newdata = te))
actual <- as.character(te$Outcome)
data.frame(
Split = seed_val,
Acc_Full = mean(p_full == actual),
Acc_Reduced = mean(p_red == actual),
Acc_RF = mean(p_rf == actual),
Sens_Full = sum(p_full == "Yes" & actual == "Yes") / sum(actual == "Yes"),
Sens_Reduced = sum(p_red == "Yes" & actual == "Yes") / sum(actual == "Yes"),
Sens_RF = sum(p_rf == "Yes" & actual == "Yes") / sum(actual == "Yes")
)
}
# Run two additional random splits
results_list <- lapply(c(2024, 2025), run_split)
results_df <- do.call(rbind, results_list)
print(round(results_df, 4))## Split Acc_Full Acc_Reduced Acc_RF Sens_Full Sens_Reduced Sens_RF
## 1 2024 0.7969 0.7969 0.7578 0.5915 0.5775 0.5282
## 2 2025 0.7891 0.7812 0.7448 0.6296 0.6222 0.6667
Conclusions from repeated splits:
Comparing results across the fixed first-half split (Part a) and the two random splits:
library(randomForest) library(readr)
diabetes <- read_csv(“diabetes.csv”)