Author: C McGinnis Date: November 2024
Course: Deep Learning GitHub Repository: [Your GitHub URL]
This project addresses the Kaggle competition “Histopathologic Cancer Detection”, which focuses on identifying metastatic cancer in small image patches taken from larger digital pathology scans of lymph node sections.
Histopathology is the microscopic examination of tissue to study the manifestations of disease. In cancer diagnosis, pathologists examine tissue samples under a microscope to identify the presence of cancer cells. This process is critical for:
Manual examination of histopathology slides is:
Deep learning solutions can assist pathologists by providing rapid screening, highlighting regions of interest, and offering a “second opinion” to reduce diagnostic errors.
| Aspect | Description |
|---|---|
| Task Type | Binary Image Classification |
| Input | 96×96 pixel RGB image patches (.tif format) |
| Output | Probability of containing metastatic tissue (0 to 1) |
| Positive Label (1) | Center 32×32 pixel region contains at least one tumor pixel |
| Negative Label (0) | No tumor tissue in the center region |
| Evaluation Metric | Area Under the ROC Curve (AUC-ROC) |
Important Note: The label is determined by the center 32×32 pixel region only. Tumor tissue in the outer border of the 96×96 patch does not affect the label.
We will use Transfer Learning with a pre-trained ResNet-18 convolutional neural network. This approach is effective because:
The dataset is a modified version of the PatchCamelyon (PCam) benchmark dataset, derived from the Camelyon16 challenge for detecting metastases in lymph node sections.
| Component | Description |
|---|---|
| Training Images | 220,025 images (96×96 RGB, .tif format) |
| Test Images | 57,458 images for Kaggle submission |
| Labels File | train_labels.csv with image IDs and binary labels |
| Total Size | Approximately 6-7 GB |
histopathologic-cancer-detection/
├── train/ # Training images (220,025 .tif files)
├── test/ # Test images (57,458 .tif files)
├── train_labels.csv # Training labels (id, label)
└── sample_submission.csv # Submission format template
In this section, we import the Python libraries needed for:
numpy, pandas)matplotlib, seaborn)PIL, torchvision.transforms)torch, torchvision.models)sklearn.metrics)We also: - Set a random seed for reproducibility - Detect whether a GPU is available - Configure PyTorch for deterministic behavior (as much as possible)
# General utilities
import os
import random
import warnings
# Data handling
import numpy as np
import pandas as pd
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Image handling
from PIL import Image
# PyTorch core
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
# TorchVision (for transforms + pretrained models)
from torchvision import transforms, models
# Metrics
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report
# Silence unnecessary warnings (optional)
warnings.filterwarnings("ignore")
# -------------------------
# Reproducibility settings
# -------------------------
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
# For (more) deterministic behavior
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
from tqdm.auto import tqdm
# -------------------------
# Device configuration
# -------------------------
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
Using device: cpu
In this section, we:
train_labels.csv into a pandas DataFrameIf you are running this notebook on a different machine or platform (e.g., Kaggle, Colab), you may need to update the DATA_DIR path accordingly.
# Base directory where the Kaggle dataset is stored
# Update this path if your dataset is in a different location
DATA_DIR = "/Users/cynthiamcginnis/Downloads/histopathologic-cancer-detection"
TRAIN_DIR = os.path.join(DATA_DIR, "train")
TEST_DIR = os.path.join(DATA_DIR, "test")
LABELS_PATH = os.path.join(DATA_DIR, "train_labels.csv")
print("DATA_DIR :", DATA_DIR)
print("TRAIN_DIR:", TRAIN_DIR)
print("TEST_DIR :", TEST_DIR)
print("LABELS :", LABELS_PATH)
# Load labels
labels_df = pd.read_csv(LABELS_PATH)
print("\nLabels DataFrame shape:", labels_df.shape)
labels_df.head()
DATA_DIR : /Users/cynthiamcginnis/Downloads/histopathologic-cancer-detection
TRAIN_DIR: /Users/cynthiamcginnis/Downloads/histopathologic-cancer-detection/train
TEST_DIR : /Users/cynthiamcginnis/Downloads/histopathologic-cancer-detection/test
LABELS : /Users/cynthiamcginnis/Downloads/histopathologic-cancer-detection/train_labels.csv
Labels DataFrame shape: (220025, 2)
| id | label | |
|---|---|---|
| 0 | f38a6374c348f90b587e046aac6079959adf3835 | 0 |
| 1 | c18f2d887b7ae4f6742ee445113fa1aef383ed77 | 1 |
| 2 | 755db6279dae599ebb4d39a9123cce439965282d | 0 |
| 3 | bc3f0c64fb968ff4a8bd33af6971ecae77c75e08 | 0 |
| 4 | 068aba587a4950175d04c680d38943fd488d6a9d | 0 |
DEBUG = True
DEBUG_SIZE = 5000
if DEBUG:
labels_df_small = labels_df.sample(n=DEBUG_SIZE, random_state=SEED).reset_index(drop=True)
else:
labels_df_small = labels_df
print("Using", len(labels_df_small), "samples.")
Using 5000 samples.
from sklearn.model_selection import train_test_split
train_df, val_df = train_test_split(
labels_df_small,
test_size=0.2, # 80% train, 20% val
stratify=labels_df_small["label"],
random_state=SEED
)
print("Train size:", len(train_df))
print("Val size :", len(val_df))
print("\nTrain label distribution:")
print(train_df["label"].value_counts(normalize=True))
print("\nVal label distribution:")
print(val_df["label"].value_counts(normalize=True))
Train size: 4000
Val size : 1000
Train label distribution:
label
0 0.59425
1 0.40575
Name: proportion, dtype: float64
Val label distribution:
label
0 0.594
1 0.406
Name: proportion, dtype: float64
We define a custom torch.utils.data.Dataset to handle:
train_labels.csv.tif image by its idThis class is flexible enough to be used for both: - Training/validation (with labels), and - Test set (without labels), by toggling a has_labels flag.
class HistopathDataset(Dataset):
"""
Custom Dataset for the Histopathologic Cancer Detection dataset.
Args:
df (pd.DataFrame): DataFrame containing at least an 'id' column.
If labels are available, it should also contain a 'label' column.
img_dir (str): Directory where the .tif image files are stored.
transform (callable, optional): Optional transform to be applied on a sample.
has_labels (bool): Set to True if df contains labels. For test data, set to False.
"""
def __init__(self, df, img_dir, transform=None, has_labels=True):
self.df = df.reset_index(drop=True)
self.img_dir = img_dir
self.transform = transform
self.has_labels = has_labels
# Sanity checks
if self.has_labels and "label" not in self.df.columns:
raise ValueError("has_labels=True but no 'label' column found in DataFrame.")
if "id" not in self.df.columns:
raise ValueError("DataFrame must contain an 'id' column.")
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
# Get image ID
img_id = self.df.loc[idx, "id"]
img_path = os.path.join(self.img_dir, f"{img_id}.tif")
# Load image
image = Image.open(img_path)
# Convert to RGB if needed (sometimes medical images can be single-channel)
if image.mode != "RGB":
image = image.convert("RGB")
# Apply transforms
if self.transform is not None:
image = self.transform(image)
if self.has_labels:
label = int(self.df.loc[idx, "label"])
return image, label
else:
# For test set (no labels)
return image, img_id # return id so we can match predictions later
# -------------------------
# Basic transform (no augmentation yet)
# -------------------------
# For ResNet (pretrained on ImageNet), we typically normalize images with ImageNet stats.
basic_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406], # ImageNet mean
std=[0.229, 0.224, 0.225] # ImageNet std
)
])
# Example: create a training dataset instance
train_dataset = HistopathDataset(
df=labels_df,
img_dir=TRAIN_DIR,
transform=basic_transform,
has_labels=True
)
print("Number of training samples:", len(train_dataset))
# Peek at one sample
sample_img, sample_label = train_dataset[0]
print("Single sample shape:", sample_img.shape)
print("Single sample label:", sample_label)
Number of training samples: 220025
Single sample shape: torch.Size([3, 96, 96])
Single sample label: 0
In this section, we perform basic exploratory data analysis to understand the dataset before training a model. Specifically, we will:
This helps us: - Check for class imbalance. - Confirm that all images have the expected dimensions (96×96, 3 channels). - Build intuition about what tumor and non-tumor patches look like.
# Basic label distribution
label_counts = labels_df["label"].value_counts().sort_index()
label_percent = labels_df["label"].value_counts(normalize=True).sort_index() * 100
print("Label counts:\n", label_counts)
print("\nLabel percentage:\n", label_percent.round(2))
# Plot label distribution
plt.figure(figsize=(5, 4))
sns.barplot(x=label_counts.index, y=label_counts.values)
plt.xticks([0, 1], ["Non-tumor (0)", "Tumor (1)"])
plt.xlabel("Label")
plt.ylabel("Count")
plt.title("Label Distribution")
plt.tight_layout()
plt.show()
Label counts:
label
0 130908
1 89117
Name: count, dtype: int64
Label percentage:
label
0 59.5
1 40.5
Name: proportion, dtype: float64
png
from collections import Counter
def inspect_random_images(df, img_dir, n_samples=500):
sizes = []
modes = []
sample_df = df.sample(n=min(n_samples, len(df)), random_state=SEED)
for img_id in sample_df["id"]:
img_path = os.path.join(img_dir, f"{img_id}.tif")
with Image.open(img_path) as img:
sizes.append(img.size) # (width, height)
modes.append(img.mode)
size_counts = Counter(sizes)
mode_counts = Counter(modes)
print("Most common sizes:", size_counts.most_common(5))
print("Image modes:", mode_counts)
inspect_random_images(labels_df, TRAIN_DIR, n_samples=500)
Most common sizes: [((96, 96), 500)]
Image modes: Counter({'RGB': 500})
def show_image_grid(df, img_dir, label, n=9, n_cols=3):
"""
Show a grid of random images for a given label (0 or 1).
"""
subset = df[df["label"] == label].sample(n=n, random_state=SEED)
n_rows = int(np.ceil(n / n_cols))
plt.figure(figsize=(3 * n_cols, 3 * n_rows))
for i, img_id in enumerate(subset["id"].values):
img_path = os.path.join(img_dir, f"{img_id}.tif")
img = Image.open(img_path)
plt.subplot(n_rows, n_cols, i + 1)
plt.imshow(img)
plt.axis("off")
class_name = "Tumor (1)" if label == 1 else "Non-tumor (0)"
plt.suptitle(f"Random examples: {class_name}", fontsize=14)
plt.tight_layout()
plt.show()
# Show non-tumor examples
show_image_grid(labels_df, TRAIN_DIR, label=0, n=9, n_cols=3)
# Show tumor examples
show_image_grid(labels_df, TRAIN_DIR, label=1, n=9, n_cols=3)
png
png
def plot_pixel_histogram(df, img_dir, n_samples=200):
sample_df = df.sample(n=min(n_samples, len(df)), random_state=SEED)
pixels = []
for img_id in sample_df["id"]:
img_path = os.path.join(img_dir, f"{img_id}.tif")
img = Image.open(img_path).convert("RGB")
arr = np.array(img) / 255.0 # scale to 0–1
pixels.append(arr.reshape(-1, 3)) # flatten H×W×C -> (N, 3)
pixels = np.vstack(pixels)
plt.figure(figsize=(8, 4))
plt.hist(pixels[:, 0], bins=50, alpha=0.5, label="R")
plt.hist(pixels[:, 1], bins=50, alpha=0.5, label="G")
plt.hist(pixels[:, 2], bins=50, alpha=0.5, label="B")
plt.legend()
plt.title("Pixel Intensity Distribution (Sample)")
plt.xlabel("Intensity (0–1)")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()
plot_pixel_histogram(labels_df, TRAIN_DIR, n_samples=200)
png
We now:
HistopathDataset.DataLoaders.# 4.1 Debug subset + train/validation split
DEBUG = True
DEBUG_SIZE = 5000 # number of samples to use while prototyping
if DEBUG:
labels_df_small = labels_df.sample(n=DEBUG_SIZE, random_state=SEED).reset_index(drop=True)
else:
labels_df_small = labels_df
print("Using", len(labels_df_small), "samples.")
train_df, val_df = train_test_split(
labels_df_small,
test_size=0.2,
stratify=labels_df_small["label"],
random_state=SEED
)
print("Train size:", len(train_df))
print("Val size :", len(val_df))
print("\nTrain label distribution:")
print(train_df["label"].value_counts(normalize=True))
print("\nVal label distribution:")
print(val_df["label"].value_counts(normalize=True))
Using 5000 samples.
Train size: 4000
Val size : 1000
Train label distribution:
label
0 0.59425
1 0.40575
Name: proportion, dtype: float64
Val label distribution:
label
0 0.594
1 0.406
Name: proportion, dtype: float64
# 4.2 Image transforms & augmentation
IMAGENET_MEAN = [0.485, 0.456, 0.406]
IMAGENET_STD = [0.229, 0.224, 0.225]
train_transform = transforms.Compose([
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomVerticalFlip(p=0.5),
transforms.RandomRotation(degrees=15),
transforms.ToTensor(),
transforms.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
])
val_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
])
# 4.3 Custom Dataset class
class HistopathDataset(Dataset):
"""
Dataset for Histopathologic Cancer Detection.
Expects a DataFrame with 'id' and (optionally) 'label'.
"""
def __init__(self, df, img_dir, transform=None, has_labels=True):
self.df = df.reset_index(drop=True)
self.img_dir = img_dir
self.transform = transform
self.has_labels = has_labels
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
img_id = self.df.loc[idx, "id"]
img_path = os.path.join(self.img_dir, f"{img_id}.tif")
image = Image.open(img_path)
if image.mode != "RGB":
image = image.convert("RGB")
if self.transform is not None:
image = self.transform(image)
if self.has_labels:
label = int(self.df.loc[idx, "label"])
return image, label
else:
return image, img_id
# 4.4 Datasets & DataLoaders
BATCH_SIZE = 64
NUM_WORKERS = 0 # important for Jupyter / Anaconda to avoid multiprocessing issues
train_dataset = HistopathDataset(
df=train_df,
img_dir=TRAIN_DIR,
transform=train_transform,
has_labels=True
)
val_dataset = HistopathDataset(
df=val_df,
img_dir=TRAIN_DIR,
transform=val_transform,
has_labels=True
)
train_loader = DataLoader(
train_dataset,
batch_size=BATCH_SIZE,
shuffle=True,
num_workers=NUM_WORKERS,
pin_memory=torch.cuda.is_available()
)
val_loader = DataLoader(
val_dataset,
batch_size=BATCH_SIZE,
shuffle=False,
num_workers=NUM_WORKERS,
pin_memory=torch.cuda.is_available()
)
print("Train batches:", len(train_loader))
print("Val batches :", len(val_loader))
# sanity check
images, labels = next(iter(train_loader))
print("One batch images:", images.shape)
print("One batch labels:", labels.shape)
Train batches: 63
Val batches : 16
One batch images: torch.Size([64, 3, 96, 96])
One batch labels: torch.Size([64])
We use transfer learning with a pre-trained ResNet-18:
# 5.1 ResNet-18 model
from torchvision.models import resnet18, ResNet18_Weights
def create_resnet18_model(freeze_features=True):
weights = ResNet18_Weights.DEFAULT
model = resnet18(weights=weights)
if freeze_features:
for p in model.parameters():
p.requires_grad = False
in_features = model.fc.in_features
model.fc = nn.Linear(in_features, 1) # one logit
return model
model = create_resnet18_model(freeze_features=True).to(device)
print("Model created on device:", device)
Model created on device: cpu
We now define:
BCEWithLogitsLoss)# 6.1 Loss, optimizer, scheduler
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(
[p for p in model.parameters() if p.requires_grad],
lr=1e-4
)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
optimizer,
mode="min",
factor=0.5,
patience=2,
verbose=True
)
# 6.2 Training loop with AUC and tqdm progress bars
def train_model(model, criterion, optimizer, scheduler,
train_loader, val_loader, num_epochs=5, device=device):
history = {"train_loss": [], "val_loss": [], "val_auc": []}
best_auc = 0.0
best_state_dict = None
for epoch in range(1, num_epochs + 1):
print(f"\nEpoch {epoch}/{num_epochs}")
print("-" * 30)
# ----- TRAIN -----
model.train()
running_train_loss = 0.0
for images, labels in tqdm(train_loader, desc=f"Train epoch {epoch}", leave=False):
images = images.to(device)
labels = labels.to(device).float().unsqueeze(1)
outputs = model(images)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
running_train_loss += loss.item() * images.size(0)
epoch_train_loss = running_train_loss / len(train_loader.dataset)
# ----- VALIDATE -----
model.eval()
running_val_loss = 0.0
all_probs, all_labels = [], []
for images, labels in tqdm(val_loader, desc=f"Val epoch {epoch}", leave=False):
images = images.to(device)
labels = labels.to(device).float().unsqueeze(1)
with torch.no_grad():
outputs = model(images)
loss = criterion(outputs, labels)
running_val_loss += loss.item() * images.size(0)
probs = torch.sigmoid(outputs).cpu().numpy().ravel()
all_probs.extend(probs)
all_labels.extend(labels.cpu().numpy().ravel())
epoch_val_loss = running_val_loss / len(val_loader.dataset)
epoch_val_auc = roc_auc_score(all_labels, all_probs)
if scheduler is not None:
scheduler.step(epoch_val_loss)
history["train_loss"].append(epoch_train_loss)
history["val_loss"].append(epoch_val_loss)
history["val_auc"].append(epoch_val_auc)
print(f"Train Loss: {epoch_train_loss:.4f} | "
f"Val Loss: {epoch_val_loss:.4f} | "
f"Val AUC: {epoch_val_auc:.4f}")
if epoch_val_auc > best_auc:
best_auc = epoch_val_auc
best_state_dict = model.state_dict()
print(f"🔥 New best model with AUC: {best_auc:.4f} (saving weights)")
if best_state_dict is not None:
model.load_state_dict(best_state_dict)
print(f"\nTraining complete. Best Val AUC: {best_auc:.4f}")
else:
print("\nTraining complete, but no valid AUC was computed.")
return model, history
# 6.3 Run a short debug training
NUM_EPOCHS = 5 # keep small while testing
model, history = train_model(
model=model,
criterion=criterion,
optimizer=optimizer,
scheduler=scheduler,
train_loader=train_loader,
val_loader=val_loader,
num_epochs=NUM_EPOCHS,
device=device
)
Epoch 1/5
------------------------------
Train epoch 1: 0%| | 0/63 [00:00<?, ?it/s]
Val epoch 1: 0%| | 0/16 [00:00<?, ?it/s]
Train Loss: 0.5021 | Val Loss: 0.5169 | Val AUC: 0.8334
🔥 New best model with AUC: 0.8334 (saving weights)
Epoch 2/5
------------------------------
Train epoch 2: 0%| | 0/63 [00:00<?, ?it/s]
Val epoch 2: 0%| | 0/16 [00:00<?, ?it/s]
Train Loss: 0.4896 | Val Loss: 0.5091 | Val AUC: 0.8437
🔥 New best model with AUC: 0.8437 (saving weights)
Epoch 3/5
------------------------------
Train epoch 3: 0%| | 0/63 [00:00<?, ?it/s]
Val epoch 3: 0%| | 0/16 [00:00<?, ?it/s]
Train Loss: 0.4782 | Val Loss: 0.5012 | Val AUC: 0.8479
🔥 New best model with AUC: 0.8479 (saving weights)
Epoch 4/5
------------------------------
Train epoch 4: 0%| | 0/63 [00:00<?, ?it/s]
Val epoch 4: 0%| | 0/16 [00:00<?, ?it/s]
Train Loss: 0.4649 | Val Loss: 0.4888 | Val AUC: 0.8555
🔥 New best model with AUC: 0.8555 (saving weights)
Epoch 5/5
------------------------------
Train epoch 5: 0%| | 0/63 [00:00<?, ?it/s]
Val epoch 5: 0%| | 0/16 [00:00<?, ?it/s]
Train Loss: 0.4603 | Val Loss: 0.4852 | Val AUC: 0.8569
🔥 New best model with AUC: 0.8569 (saving weights)
Training complete. Best Val AUC: 0.8569
We visualize training and validation loss, as well as validation AUC-ROC, across epochs. Even though we only trained for 1 epoch in debug mode, this code works for longer runs too.
# 7.1 Plot loss and AUC curves from history
def plot_training_history(history):
epochs = range(1, len(history["train_loss"]) + 1)
plt.figure(figsize=(12, 4))
# Loss
plt.subplot(1, 2, 1)
plt.plot(epochs, history["train_loss"], marker="o", label="Train Loss")
plt.plot(epochs, history["val_loss"], marker="o", label="Val Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Training vs Validation Loss")
plt.legend()
plt.grid(True)
# AUC
plt.subplot(1, 2, 2)
plt.plot(epochs, history["val_auc"], marker="o", label="Val AUC")
plt.xlabel("Epoch")
plt.ylabel("AUC-ROC")
plt.title("Validation AUC-ROC")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
plot_training_history(history)
png
We now evaluate the final model on the validation set:
from sklearn.metrics import confusion_matrix, classification_report
# 7.2 Evaluate on validation set
model.eval()
all_probs = []
all_labels = []
with torch.no_grad():
for images, labels in val_loader:
images = images.to(device)
labels = labels.to(device).float().unsqueeze(1)
outputs = model(images)
probs = torch.sigmoid(outputs).cpu().numpy().ravel()
all_probs.extend(probs)
all_labels.extend(labels.cpu().numpy().ravel())
all_probs = np.array(all_probs)
all_labels = np.array(all_labels)
# AUC-ROC
val_auc = roc_auc_score(all_labels, all_probs)
print(f"Validation AUC-ROC (recomputed): {val_auc:.4f}")
# Hard predictions with threshold 0.5
y_pred = (all_probs >= 0.5).astype(int)
cm = confusion_matrix(all_labels, y_pred)
print("\nConfusion Matrix:\n", cm)
print("\nClassification Report:\n")
print(classification_report(all_labels, y_pred, digits=4))
Validation AUC-ROC (recomputed): 0.8569
Confusion Matrix:
[[526 68]
[136 270]]
Classification Report:
precision recall f1-score support
0.0 0.7946 0.8855 0.8376 594
1.0 0.7988 0.6650 0.7258 406
accuracy 0.7960 1000
macro avg 0.7967 0.7753 0.7817 1000
weighted avg 0.7963 0.7960 0.7922 1000
The competition expects a CSV file with the columns:
id – image ID (without .tif)label – predicted probability that the image contains tumor tissue (class 1)In this section, we: 1. Load sample_submission.csv to get the test image IDs in the correct order. 2. Create a HistopathDataset for the test folder (no labels). 3. Run the trained model on all test images to generate predicted probabilities. 4. Save a submission.csv file that can be uploaded to Kaggle.
# 8.1 Load sample_submission and create test dataset/dataloader
# Path to the Kaggle sample_submission file in your dataset directory
SAMPLE_SUB_PATH = os.path.join(DATA_DIR, "/Users/cynthiamcginnis/Downloads/histopathologic-cancer-detection/sample_submission.csv")
sample_sub_df = pd.read_csv(SAMPLE_SUB_PATH)
print("sample_submission shape:", sample_sub_df.shape)
sample_sub_df.head()
sample_submission shape: (57458, 2)
| id | label | |
|---|---|---|
| 0 | 0b2ea2a822ad23fdb1b5dd26653da899fbd2c0d5 | 0 |
| 1 | 95596b92e5066c5c52466c90b69ff089b39f2737 | 0 |
| 2 | 248e6738860e2ebcf6258cdc1f32f299e0c76914 | 0 |
| 3 | 2c35657e312966e9294eac6841726ff3a748febf | 0 |
| 4 | 145782eb7caa1c516acbe2eda34d9a3f31c41fd6 | 0 |
# How many test IDs are there?
len(sample_sub_df), sample_sub_df.head()
(57458,
id label
0 0b2ea2a822ad23fdb1b5dd26653da899fbd2c0d5 0
1 95596b92e5066c5c52466c90b69ff089b39f2737 0
2 248e6738860e2ebcf6258cdc1f32f299e0c76914 0
3 2c35657e312966e9294eac6841726ff3a748febf 0
4 145782eb7caa1c516acbe2eda34d9a3f31c41fd6 0)
# Do we have predictions for them?
len(id_to_prob)
31040
# We only need the 'id' column for the test set
test_df = sample_sub_df[["id"]].copy()
# Test dataset: has_labels=False so __getitem__ returns (image, id)
test_dataset = HistopathDataset(
df=test_df,
img_dir=TEST_DIR,
transform=val_transform,
has_labels=False
)
test_loader = DataLoader(
test_dataset,
batch_size=BATCH_SIZE,
shuffle=False, # do NOT shuffle; we keep Kaggle order
num_workers=0,
pin_memory=torch.cuda.is_available()
)
print("Number of test batches:", len(test_loader))
Number of test batches: 898
We now run the trained model in evaluation mode on all test images:
We then match these probabilities back to the sample_submission.csv order.
# 8.2 Inference on the test set to create Kaggle predictions (with progress prints)
model.eval()
id_to_prob = {}
total_batches = len(test_loader)
print("Total test batches:", total_batches)
with torch.no_grad():
for batch_idx, (images, ids) in enumerate(test_loader, start=1):
images = images.to(device)
outputs = model(images) # [B, 1]
probs = torch.sigmoid(outputs).cpu().numpy().ravel() # [B]
for img_id, p in zip(ids, probs):
id_to_prob[img_id] = p
if batch_idx % 20 == 0 or batch_idx == total_batches:
print(f"Processed {batch_idx}/{total_batches} batches")
print("Number of predictions in id_to_prob:", len(id_to_prob))
Total test batches: 898
Processed 20/898 batches
Processed 40/898 batches
Processed 60/898 batches
Processed 80/898 batches
Processed 100/898 batches
Processed 120/898 batches
Processed 140/898 batches
Processed 160/898 batches
Processed 180/898 batches
Processed 200/898 batches
Processed 220/898 batches
Processed 240/898 batches
Processed 260/898 batches
Processed 280/898 batches
Processed 300/898 batches
Processed 320/898 batches
Processed 340/898 batches
Processed 360/898 batches
Processed 380/898 batches
Processed 400/898 batches
Processed 420/898 batches
Processed 440/898 batches
Processed 460/898 batches
Processed 480/898 batches
Processed 500/898 batches
Processed 520/898 batches
Processed 540/898 batches
Processed 560/898 batches
Processed 580/898 batches
Processed 600/898 batches
Processed 620/898 batches
Processed 640/898 batches
Processed 660/898 batches
Processed 680/898 batches
Processed 700/898 batches
Processed 720/898 batches
Processed 740/898 batches
Processed 760/898 batches
Processed 780/898 batches
Processed 800/898 batches
Processed 820/898 batches
Processed 840/898 batches
Processed 860/898 batches
Processed 880/898 batches
Processed 898/898 batches
Number of predictions in id_to_prob: 57458
# 8.2 Inference on the test set to create Kaggle predictions
model.eval()
id_to_prob = {} # dict: image id -> predicted probability
with torch.no_grad():
for images, ids in test_loader: # ids is a list of image ids from HistopathDataset
images = images.to(device)
outputs = model(images) # shape [B, 1]
probs = torch.sigmoid(outputs).cpu().numpy().ravel() # shape [B]
for img_id, p in zip(ids, probs):
id_to_prob[img_id] = p
print("Number of predictions:", len(id_to_prob))
Number of predictions: 57458
submission.csv and SaveWe now:
sample_submission.csv so we preserve the exact ordering of test image IDs.label column using our predicted probabilities.submission_resnet18_debug.csv.This CSV can be uploaded directly to Kaggle for scoring.
# 8.3 Create submission DataFrame in the same order as sample_submission
submission_df = sample_sub_df.copy()
submission_df["label"] = submission_df["id"].map(id_to_prob)
# Sanity check: no missing predictions
missing = submission_df["label"].isnull().sum()
print("Missing predictions:", missing)
OUTPUT_CSV = "submission_resnet18_debug.csv"
submission_df.to_csv(OUTPUT_CSV, index=False)
print(f"Saved submission file: {OUTPUT_CSV}")
submission_df.head()
Missing predictions: 0
Saved submission file: submission_resnet18_debug.csv
| id | label | |
|---|---|---|
| 0 | 0b2ea2a822ad23fdb1b5dd26653da899fbd2c0d5 | 0.233324 |
| 1 | 95596b92e5066c5c52466c90b69ff089b39f2737 | 0.760667 |
| 2 | 248e6738860e2ebcf6258cdc1f32f299e0c76914 | 0.301223 |
| 3 | 2c35657e312966e9294eac6841726ff3a748febf | 0.280466 |
| 4 | 145782eb7caa1c516acbe2eda34d9a3f31c41fd6 | 0.368737 |
To externally validate the model, I generated predictions on the Kaggle test set and submitted the file submission_resnet18_debug.csv to the Histopathologic Cancer Detection competition. Kaggle evaluates submissions using AUC-ROC on a hidden test set split into a public and a private portion.
My best model (ResNet-18 with transfer learning, trained for 5 epochs on a 5,000-sample subset) achieved:
The Kaggle scores are consistent with the internal validation performance: all three values are in the 0.80–0.84 range, indicating that the model generalizes reasonably well to unseen data and that my local validation procedure is not severely optimistic. The small gap between public and private scores is expected, as they are computed on different subsets of the hidden test set. Overall, the Kaggle evaluation confirms that the trained model learns clinically relevant patterns for metastatic cancer detection in the histopathology patches.
Screenshot 2025-11-21 at 7.55.03 PM.png