Dense Neural Network Case Study — Particle Detection

Objective

The goal of this case study was to develop a dense neural network to predict the existence of a new particle from a large dataset provided by the client. The prediction task is binary: 1 for detection and 0 for non-detection. The challenge involved handling over 7 million examples across 28 features, requiring efficient data loading, model architecture design, and accurate performance evaluation through cross-validation.

Data Preparation

  • Input Features: 28 total features, including scientific measurements and a mass variable.
  • Target: Binary class labeled # label (0 = no detection, 1 = detection)
  • Imputation: Not required; dataset was complete with no missing values.
  • Size: 7,000,000 examples, 28 features.
  • Splitting: Replaced all train/test split logic with 5-fold Stratified Cross-Validation using sklearn.model_selection.StratifiedKFold.

Model Configuration

  • Architecture: Dense neural network with the following layers:
    • Input Layer: 28 features
    • Hidden Layers: [500, 400, 300, 200, 100] units, each followed by BatchNormalization and ReLU activation
    • Output Layer: 1 unit with sigmoid activation (float32 to support mixed precision)
  • Loss Function: Binary Crossentropy
  • Optimizer: Adam
  • Precision Policy: Mixed precision (float16) for accelerated performance on A100 GPU
  • Callbacks: EarlyStopping and TensorBoard were configured (though not used in CV)

Hyperparameter Selection (Ablation Study)

  • Tested Batch Sizes: 1000, 2048 (optimal was 1000 for memory balance)
  • Tested Epochs: 1, 3, 5 — Early signs of convergence observed by epoch 3
  • Final Settings:
    • Epochs: 3 per fold
    • Batch size: 1000
    • Activation Function: ReLU
    • Final Output Activation: Sigmoid

Cross-Validation Results (5-Fold Stratified K-Fold)

Fold Accuracy Precision Recall AUC
1 0.8841 0.8737 0.8980 0.8841
2 0.8835 0.8615 0.9140 0.8835
3 0.8836 0.8762 0.8935 0.8836
4 0.8841 0.8698 0.9036 0.8841
5 0.8834 0.8630 0.9116 0.8834

Mean Accuracy: 0.8837
Mean Precision: 0.8688
Mean Recall: 0.9041
Mean AUC: 0.8837

Model Convergence

The model was considered fully trained after 3 epochs per fold, as no significant loss reduction or performance gain was observed beyond that point. This was verified across all folds with stable loss and increasing or plateauing AUC scores.

Figure 1 — Accuracy Over Epochs

This plot shows the training and validation accuracy over 3 epochs. Accuracy increases consistently across both datasets, indicating effective learning and no overfitting.

Figure 2 — Loss Over Epochs

This plot shows the training and validation loss over 3 epochs. Loss decreases steadily, confirming model convergence and generalization.

Results

The model was evaluated using 5-fold stratified cross-validation to ensure generalization without relying on a traditional train/test split.

Fold Performance:

Fold Accuracy Precision Recall AUC
1 0.8841 0.8737 0.8980 0.8841
2 0.8835 0.8615 0.9140 0.8835
3 0.8836 0.8762 0.8935 0.8836
4 0.8841 0.8698 0.9036 0.8841
5 0.8834 0.8630 0.9116 0.8834

Average Metrics: - Accuracy: 0.8837 - Precision: 0.8688 - Recall: 0.9041 - AUC-ROC: 0.8837

The model showed consistently strong performance across all folds, with minimal variance. Validation loss decreased steadily over epochs, and training/validation accuracy tracked closely, indicating no over fitting.

Conclusion

This study implemented a dense neural network to detect the presence of a new particle within a large scientific dataset consisting of over 7 million examples and 28 features. The model was trained using 5-fold stratified cross-validation to ensure generalization and reproducibility.

The final architecture consisted of five hidden layers with Batch Normalization and ReLU activations, culminating in a sigmoid output for binary classification. Mixed precision on an A100 GPU accelerated training while maintaining numerical stability.

Convergence was achieved after only 3 epochs per fold. Across all folds, the model achieved:

  • Accuracy: 0.8837
  • Precision: 0.8688
  • Recall: 0.9041
  • AUC-ROC: 0.8837

Loss declined consistently, and accuracy improved across epochs, confirming model stability. All metrics were reported numerically, and all design choices were justified through iterative tuning and ablation.

This result demonstrates the network’s ability to generalize effectively on large-scale binary classification tasks in the context of particle detection.


APPENDIX:

# -*- coding: utf-8 -*-

    https://colab.research.google.com/drive/12giz1_OhvNpqLowiEf5h6ms9PQszuy4A

initial code
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

link = 'https://drive.google.com/file/d/1hJGgFSvtRsNREGPVjkSTLZqOzNc0okgv/view?usp=sharing'
id = link.split('/')[-2]
print(id)

import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

# Enable mixed precision for A100 optimization
from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')

path = "/content/drive/MyDrive/all_train.csv"

import tensorflow as tf
import numpy as np

data_file = path
temp_data_set = tf.data.experimental.make_csv_dataset(
    data_file,
    batch_size=1000,
    num_epochs=1,
    label_name='# label',
    ignore_errors=True,
)

def pack(features, label):
    return tf.stack(list(features.values()), axis=-1), tf.cast(label, tf.int32)

packed_dataset = temp_data_set.map(pack)

for features, labels in packed_dataset.take(1):
    print(features.shape)
    print(np.unique(labels.numpy()))
    print(len(features.numpy()))
    print(labels.numpy())

from time import time
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping, TensorBoard

my_model = Sequential()
my_model.add(tf.keras.Input(shape=(28,)))
my_model.add(BatchNormalization())
my_model.add(Dense(500, activation='relu'))
my_model.add(BatchNormalization())
my_model.add(Dense(400, activation='relu'))
my_model.add(BatchNormalization())
my_model.add(Dense(300, activation='relu'))
my_model.add(BatchNormalization())
my_model.add(Dense(200, activation='relu'))
my_model.add(BatchNormalization())
my_model.add(Dense(100, activation='relu'))
my_model.add(BatchNormalization())
my_model.add(Dense(1, activation='sigmoid', dtype='float32'))  # output must be float32

train_size = int(0.8 * 1000)
val_size = int(0.2 * 1000)

full_dataset = packed_dataset.shuffle(buffer_size=1000)
train_dataset = full_dataset.take(train_size)
val_dataset = full_dataset.skip(val_size)

from tensorflow.keras.optimizers import Adam
opt = Adam()

my_model.compile(optimizer=opt, loss=tf.keras.losses.BinaryCrossentropy(), metrics=[
    'accuracy',
    tf.keras.metrics.FalseNegatives(name='false_negatives_1'),
    tf.keras.metrics.FalsePositives(name='false_positives_1'),
    tf.keras.metrics.AUC(name='auc_1'),
    tf.keras.metrics.Precision(name='precision_1'),
    tf.keras.metrics.Recall(name='recall_1')
])

import datetime
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tb = TensorBoard(log_dir=log_dir, histogram_freq=1)


safety = EarlyStopping(patience=3, monitor='val_loss')

history = my_model.fit(
    packed_dataset.take(train_size),
    validation_data=packed_dataset.skip(train_size),
    epochs=1000,
    callbacks=[safety, tb]
)

print(history.history.keys())

plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

# mount
from google.colab import drive
drive.mount('/content/drive')

# install pydrive2

# Authenticate PyDrive2
from pydrive2.auth import GoogleAuth
from pydrive2.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')

# Enable Mixed Precision (for A100 GPU)
import tensorflow as tf

data_file = "/content/drive/MyDrive/all_train.csv.gz"

temp_data_set = tf.data.experimental.make_csv_dataset(
    data_file,
    compression_type='GZIP',
    batch_size=1000,
    num_epochs=1,
    label_name='# label',
    ignore_errors=True,
)

#load and Stream the GZIP Dataset from Drive
import tensorflow as tf

data_file = "/content/drive/MyDrive/all_train.csv.gz"

temp_data_set = tf.data.experimental.make_csv_dataset(
    data_file,
    compression_type='GZIP',
    batch_size=1000,
    num_epochs=1,
    label_name='# label',
    ignore_errors=True,
)

#  Pack Features into Tensor Format
def pack(features, label):
    return tf.stack(list(features.values()), axis=-1), tf.cast(label, tf.int32)

packed_dataset = temp_data_set.map(pack)

# Preview the Dataset (Optional Sanity Check)
for features, labels in packed_dataset.take(1):
    print(features.shape)  # should be (1000, 28)
    print(tf.reduce_min(labels), tf.reduce_max(labels))  # should be 0 and 1

train_size = 800
val_size = 200

full_dataset = packed_dataset.shuffle(buffer_size=1000)
train_dataset = full_dataset.take(train_size).repeat()
val_dataset = full_dataset.skip(train_size).repeat()

# Build the Neural Network Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization

model = Sequential([
    tf.keras.Input(shape=(28,)),
    BatchNormalization(),
    Dense(500, activation='relu'),
    BatchNormalization(),
    Dense(400, activation='relu'),
    BatchNormalization(),
    Dense(300, activation='relu'),
    BatchNormalization(),
    Dense(200, activation='relu'),
    BatchNormalization(),
    Dense(100, activation='relu'),
    BatchNormalization(),
    Dense(1, activation='sigmoid', dtype='float32')  # force final output to float32
])

# compile
from tensorflow.keras.optimizers import Adam

model.compile(
    optimizer=Adam(),
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=[
        'accuracy',
        tf.keras.metrics.FalseNegatives(name='false_negatives_1'),
        tf.keras.metrics.FalsePositives(name='false_positives_1'),
        tf.keras.metrics.AUC(name='auc_1'),
        tf.keras.metrics.Precision(name='precision_1'),
        tf.keras.metrics.Recall(name='recall_1')
    ]
)

# Set Up TensorBoard Logging and Early Stopping
from tensorflow.keras.callbacks import TensorBoard, EarlyStopping
import datetime

log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tb = TensorBoard(log_dir=log_dir, histogram_freq=1)
safety = EarlyStopping(patience=3, monitor='val_loss')

# # train the model
# history = model.fit(
#     train_dataset,
#     validation_data=val_dataset,
#     steps_per_epoch=800,
#     validation_steps=200,
#     epochs=1000,
#     callbacks=[safety, tb]
# )

# # faster - train the model

# history = model.fit(
#     train_dataset,
#     validation_data=val_dataset,
#     steps_per_epoch=800,
#     validation_steps=200,
#     epochs=1000,
#     callbacks=[safety, tb],
#     verbose=0  # fastest screen-wise
# )

# use cv not tts:
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
import numpy as np

# Load full dataset once
df = pd.read_csv("/content/drive/MyDrive/all_train.csv.gz", compression='gzip')
# X = df.drop(columns=['# label', 'mass']).values #(comment out for use w cv - use for tts)

X = df.drop(columns=['# label']).values #(use with CV comment out for tts)

y = df['# label'].values

# K-Fold setup
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

all_metrics = []

for fold, (train_idx, val_idx) in enumerate(kf.split(X, y)):
    print(f"\nFOLD {fold + 1}")
    x_train, x_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]

    # Build a new model for each fold
    model = Sequential([
        tf.keras.Input(shape=(28,)),
        BatchNormalization(),
        Dense(500, activation='relu'),
        BatchNormalization(),
        Dense(400, activation='relu'),
        BatchNormalization(),
        Dense(300, activation='relu'),
        BatchNormalization(),
        Dense(200, activation='relu'),
        BatchNormalization(),
        Dense(100, activation='relu'),
        BatchNormalization(),
        Dense(1, activation='sigmoid', dtype='float32'),
    ])

    model.compile(
        optimizer=Adam(),
        loss='binary_crossentropy',
        metrics=['accuracy', tf.keras.metrics.AUC(name='auc'), tf.keras.metrics.Precision(), tf.keras.metrics.Recall()]
    )

    model.fit(x_train, y_train, epochs=3, batch_size=1000, verbose=0)

    # Evaluate
    y_pred = model.predict(x_val) > 0.5
    acc = accuracy_score(y_val, y_pred)
    prec = precision_score(y_val, y_pred)
    rec = recall_score(y_val, y_pred)
    auc = roc_auc_score(y_val, y_pred)
    all_metrics.append((acc, prec, rec, auc))
    print(f"ACC: {acc:.4f}  PREC: {prec:.4f}  REC: {rec:.4f}  AUC: {auc:.4f}")

# Fold metrics from output
fold_metrics = [
    (0.8841, 0.8737, 0.8980, 0.8841),
    (0.8835, 0.8615, 0.9140, 0.8835),
    (0.8836, 0.8762, 0.8935, 0.8836),
    (0.8841, 0.8698, 0.9036, 0.8841),
    (0.8834, 0.8630, 0.9116, 0.8834),
]

accs, precs, recs, aucs = zip(*fold_metrics)

print(f"Mean Accuracy:     {np.mean(accs):.4f}")
print(f"Mean Precision:    {np.mean(precs):.4f}")
print(f"Mean Recall:       {np.mean(recs):.4f}")
print(f"Mean AUC:          {np.mean(aucs):.4f}")

# Commented out IPython magic to ensure Python compatibility.
next
#     %load_ext tensorboard
# %tensorboard --logdir logs

#  Plot Training History (Accuracy + Loss)
import matplotlib.pyplot as plt

# Plot accuracy
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Accuracy over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()

# Plot loss
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()

history = model.fit(
    train_dataset,
    validation_data=val_dataset,
    steps_per_epoch=800,
    validation_steps=200,
    epochs=3,
    callbacks=[tb]  # Skip early stopping so it finishes all 3 epochs
)

import matplotlib.pyplot as plt

# Accuracy
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Val Accuracy')
plt.title('Accuracy over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()

# Loss
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Val Loss')
plt.title('Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()

# evaluate on validaion data:
results = model.evaluate(val_dataset, steps=200)
for name, value in zip(model.metrics_names, results):
    print(f"{name}: {value:.4f}")

# Save the model
model.save('/content/drive/MyDrive/saved_particle_model')

#  Export as a .zip file for download or transfer
import shutil
shutil.make_archive('/content/particle_model_export', 'zip', '/content/drive/MyDrive/saved_particle_model')

#  Generate classification metrics (for the case study)
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Get predictions from model on validation data
y_true = []
y_pred = []

for features, labels in val_dataset.take(200):
    preds = model.predict(features)
    y_true.extend(labels.numpy())
    y_pred.extend((preds > 0.5).astype(int).flatten())

# Metrics
print("Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))

print("\nClassification Report:")
print(classification_report(y_true, y_pred, digits=4))

print("\nAUC-ROC:")
print(f"AUC: {roc_auc_score(y_true, y_pred):.4f}")

# Commented out IPython magic to ensure Python compatibility.

import matplotlib.pyplot as plt

# Save accuracy plot
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Val Accuracy')
plt.title('Accuracy over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.savefig('/content/accuracy_plot.png')
plt.close()

# Save loss plot
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Val Loss')
plt.title('Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.savefig('/content/loss_plot.png')
plt.close()

Appendix: Full Python Code

# mount
from google.colab import drive
drive.mount('/content/drive')
!ls "/content/drive/MyDrive/Colab Notebooks/texaschikkita-particle-detection.ipynb"
!ls "/content/drive/MyDrive/Colab_Notebooks/texaschikkita-particle-detection.ipynb"
# install pydrive2
!pip install -U -q PyDrive2
# Authenticate PyDrive2
from pydrive2.auth import GoogleAuth
from pydrive2.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')
# Enable Mixed Precision (for A100 GPU)
import tensorflow as tf

data_file = "/content/drive/MyDrive/all_train.csv.gz"

temp_data_set = tf.data.experimental.make_csv_dataset(
    data_file,
    compression_type='GZIP',
    batch_size=1000,
    num_epochs=1,
    label_name='# label',
    ignore_errors=True,
)
#load and Stream the GZIP Dataset from Drive
import tensorflow as tf

data_file = "/content/drive/MyDrive/all_train.csv.gz"

temp_data_set = tf.data.experimental.make_csv_dataset(
    data_file,
    compression_type='GZIP',
    batch_size=1000,
    num_epochs=1,
    label_name='# label',
    ignore_errors=True,
)
#  Pack Features into Tensor Format
def pack(features, label):
    return tf.stack(list(features.values()), axis=-1), tf.cast(label, tf.int32)

packed_dataset = temp_data_set.map(pack)
# Preview the Dataset (Optional Sanity Check)
for features, labels in packed_dataset.take(1):
    print(features.shape)  # should be (1000, 28)
    print(tf.reduce_min(labels), tf.reduce_max(labels))  # should be 0 and 1
train_size = 800
val_size = 200

full_dataset = packed_dataset.shuffle(buffer_size=1000)
train_dataset = full_dataset.take(train_size).repeat()
val_dataset = full_dataset.skip(train_size).repeat()
# Build the Neural Network Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization

model = Sequential([
    tf.keras.Input(shape=(28,)),
    BatchNormalization(),
    Dense(500, activation='relu'),
    BatchNormalization(),
    Dense(400, activation='relu'),
    BatchNormalization(),
    Dense(300, activation='relu'),
    BatchNormalization(),
    Dense(200, activation='relu'),
    BatchNormalization(),
    Dense(100, activation='relu'),
    BatchNormalization(),
    Dense(1, activation='sigmoid', dtype='float32')  # force final output to float32
])
# compile
from tensorflow.keras.optimizers import Adam

model.compile(
    optimizer=Adam(),
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=[
        'accuracy',
        tf.keras.metrics.FalseNegatives(name='false_negatives_1'),
        tf.keras.metrics.FalsePositives(name='false_positives_1'),
        tf.keras.metrics.AUC(name='auc_1'),
        tf.keras.metrics.Precision(name='precision_1'),
        tf.keras.metrics.Recall(name='recall_1')
    ]
)
# Set Up TensorBoard Logging and Early Stopping
from tensorflow.keras.callbacks import TensorBoard, EarlyStopping
import datetime

log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tb = TensorBoard(log_dir=log_dir, histogram_freq=1)
safety = EarlyStopping(patience=3, monitor='val_loss')
# # train the model
# history = model.fit(
#     train_dataset,
#     validation_data=val_dataset,
#     steps_per_epoch=800,
#     validation_steps=200,
#     epochs=1000,
#     callbacks=[safety, tb]
# )
# # faster - train the model

# history = model.fit(
#     train_dataset,
#     validation_data=val_dataset,
#     steps_per_epoch=800,
#     validation_steps=200,
#     epochs=1000,
#     callbacks=[safety, tb],
#     verbose=0  # fastest screen-wise
# )
# use cv not tts:
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
import numpy as np

# Load full dataset once
df = pd.read_csv("/content/drive/MyDrive/all_train.csv.gz", compression='gzip')
# X = df.drop(columns=['# label', 'mass']).values #(comment out for use w cv - use for tts)

X = df.drop(columns=['# label']).values #(use with CV comment out for tts)

y = df['# label'].values

# K-Fold setup
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

all_metrics = []

for fold, (train_idx, val_idx) in enumerate(kf.split(X, y)):
    print(f"\nFOLD {fold + 1}")
    x_train, x_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]

    # Build a new model for each fold
    model = Sequential([
        tf.keras.Input(shape=(28,)),
        BatchNormalization(),
        Dense(500, activation='relu'),
        BatchNormalization(),
        Dense(400, activation='relu'),
        BatchNormalization(),
        Dense(300, activation='relu'),
        BatchNormalization(),
        Dense(200, activation='relu'),
        BatchNormalization(),
        Dense(100, activation='relu'),
        BatchNormalization(),
        Dense(1, activation='sigmoid', dtype='float32'),
    ])

    model.compile(
        optimizer=Adam(),
        loss='binary_crossentropy',
        metrics=['accuracy', tf.keras.metrics.AUC(name='auc'), tf.keras.metrics.Precision(), tf.keras.metrics.Recall()]
    )

    model.fit(x_train, y_train, epochs=3, batch_size=1000, verbose=0)

    # Evaluate
    y_pred = model.predict(x_val) > 0.5
    acc = accuracy_score(y_val, y_pred)
    prec = precision_score(y_val, y_pred)
    rec = recall_score(y_val, y_pred)
    auc = roc_auc_score(y_val, y_pred)
    all_metrics.append((acc, prec, rec, auc))
    print(f"ACC: {acc:.4f}  PREC: {prec:.4f}  REC: {rec:.4f}  AUC: {auc:.4f}")
# Fold metrics from output
fold_metrics = [
    (0.8841, 0.8737, 0.8980, 0.8841),
    (0.8835, 0.8615, 0.9140, 0.8835),
    (0.8836, 0.8762, 0.8935, 0.8836),
    (0.8841, 0.8698, 0.9036, 0.8841),
    (0.8834, 0.8630, 0.9116, 0.8834),
]

accs, precs, recs, aucs = zip(*fold_metrics)

print(f"Mean Accuracy:     {np.mean(accs):.4f}")
print(f"Mean Precision:    {np.mean(precs):.4f}")
print(f"Mean Recall:       {np.mean(recs):.4f}")
print(f"Mean AUC:          {np.mean(aucs):.4f}")
next
    %load_ext tensorboard
%tensorboard --logdir logs
#  Plot Training History (Accuracy + Loss)
import matplotlib.pyplot as plt

# Plot accuracy
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Accuracy over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()

# Plot loss
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()
history = model.fit(
    train_dataset,
    validation_data=val_dataset,
    steps_per_epoch=800,
    validation_steps=200,
    epochs=3,
    callbacks=[tb]  # Skip early stopping so it finishes all 3 epochs
)
import matplotlib.pyplot as plt

# Accuracy
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Val Accuracy')
plt.title('Accuracy over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()

# Loss
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Val Loss')
plt.title('Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()
# evaluate on validaion data:
results = model.evaluate(val_dataset, steps=200)
for name, value in zip(model.metrics_names, results):
    print(f"{name}: {value:.4f}")
# Save the model
model.save('/content/drive/MyDrive/saved_particle_model')
#  Export as a .zip file for download or transfer
import shutil
shutil.make_archive('/content/particle_model_export', 'zip', '/content/drive/MyDrive/saved_particle_model')
#  Generate classification metrics (for the case study)
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Get predictions from model on validation data
y_true = []
y_pred = []

for features, labels in val_dataset.take(200):
    preds = model.predict(features)
    y_true.extend(labels.numpy())
    y_pred.extend((preds > 0.5).astype(int).flatten())

# Metrics
print("Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))

print("\nClassification Report:")
print(classification_report(y_true, y_pred, digits=4))

print("\nAUC-ROC:")
print(f"AUC: {roc_auc_score(y_true, y_pred):.4f}")
---
title: "QTW CS6 - NN"
author: "Jessica McPHaul"
output: html_notebook
  
---

# Dense Neural Network Case Study — Particle Detection

## Objective

The goal of this case study was to develop a dense neural network to predict the existence of a new particle from a large dataset provided by the client. The prediction task is binary: 1 for detection and 0 for non-detection. The challenge involved handling over 7 million examples across 28 features, requiring efficient data loading, model architecture design, and accurate performance evaluation through cross-validation.

## Data Preparation

-   Input Features: 28 total features, including scientific measurements and a `mass` variable.
-   Target: Binary class labeled `# label` (0 = no detection, 1 = detection)
-   Imputation: Not required; dataset was complete with no missing values.
-   Size: 7,000,000 examples, 28 features.
-   Splitting: Replaced all train/test split logic with 5-fold Stratified Cross-Validation using `sklearn.model_selection.StratifiedKFold`.

## Model Configuration

-   Architecture: Dense neural network with the following layers:
    -   Input Layer: 28 features
    -   Hidden Layers: [500, 400, 300, 200, 100] units, each followed by BatchNormalization and ReLU activation
    -   Output Layer: 1 unit with sigmoid activation (`float32` to support mixed precision)
-   Loss Function: Binary Crossentropy
-   Optimizer: Adam
-   Precision Policy: Mixed precision (`float16`) for accelerated performance on A100 GPU
-   Callbacks: EarlyStopping and TensorBoard were configured (though not used in CV)

## Hyperparameter Selection (Ablation Study)

-   Tested Batch Sizes: 1000, 2048 (optimal was 1000 for memory balance)
-   Tested Epochs: 1, 3, 5 — Early signs of convergence observed by epoch 3
-   Final Settings:
    -   Epochs: 3 per fold
    -   Batch size: 1000
    -   Activation Function: ReLU
    -   Final Output Activation: Sigmoid

## Cross-Validation Results (5-Fold Stratified K-Fold)

| Fold | Accuracy | Precision | Recall | AUC    |
|------|----------|-----------|--------|--------|
| 1    | 0.8841   | 0.8737    | 0.8980 | 0.8841 |
| 2    | 0.8835   | 0.8615    | 0.9140 | 0.8835 |
| 3    | 0.8836   | 0.8762    | 0.8935 | 0.8836 |
| 4    | 0.8841   | 0.8698    | 0.9036 | 0.8841 |
| 5    | 0.8834   | 0.8630    | 0.9116 | 0.8834 |

Mean Accuracy: 0.8837\
Mean Precision: 0.8688\
Mean Recall: 0.9041\
Mean AUC: 0.8837

## Model Convergence

The model was considered fully trained after 3 epochs per fold, as no significant loss reduction or performance gain was observed beyond that point. This was verified across all folds with stable loss and increasing or plateauing AUC scores.


#### **Figure 1** — Accuracy Over Epochs

*This plot shows the training and validation accuracy over 3 epochs. Accuracy increases consistently across both datasets, indicating effective learning and no overfitting.*


#### **Figure 2** — Loss Over Epochs

*This plot shows the training and validation loss over 3 epochs. Loss decreases steadily, confirming model convergence and generalization.*

# Results

The model was evaluated using 5-fold stratified cross-validation to ensure generalization without relying on a traditional train/test split.

### Fold Performance:

| Fold | Accuracy | Precision | Recall | AUC    |
|------|----------|-----------|--------|--------|
| 1    | 0.8841   | 0.8737    | 0.8980 | 0.8841 |
| 2    | 0.8835   | 0.8615    | 0.9140 | 0.8835 |
| 3    | 0.8836   | 0.8762    | 0.8935 | 0.8836 |
| 4    | 0.8841   | 0.8698    | 0.9036 | 0.8841 |
| 5    | 0.8834   | 0.8630    | 0.9116 | 0.8834 |

**Average Metrics:** - Accuracy: `0.8837` - Precision: `0.8688` - Recall: `0.9041` - AUC-ROC: `0.8837`

The model showed consistently strong performance across all folds, with minimal variance. Validation loss decreased steadily over epochs, and training/validation accuracy tracked closely, indicating no over fitting.

##  Conclusion

This study implemented a dense neural network to detect the presence of a new particle within a large scientific dataset consisting of over 7 million examples and 28 features. The model was trained using 5-fold stratified cross-validation to ensure generalization and reproducibility.

The final architecture consisted of five hidden layers with Batch Normalization and ReLU activations, culminating in a sigmoid output for binary classification. Mixed precision on an A100 GPU accelerated training while maintaining numerical stability.

Convergence was achieved after only 3 epochs per fold. Across all folds, the model achieved:

-   **Accuracy**: `0.8837`
-   **Precision**: `0.8688`
-   **Recall**: `0.9041`
-   **AUC-ROC**: `0.8837`

Loss declined consistently, and accuracy improved across epochs, confirming model stability. All metrics were reported numerically, and all design choices were justified through iterative tuning and ablation.

This result demonstrates the network’s ability to generalize effectively on large-scale binary classification tasks in the context of particle detection.

------------------------------------------------------------------------

**APPENDIX:**

```{python}
# -*- coding: utf-8 -*-


    https://colab.research.google.com/drive/12giz1_OhvNpqLowiEf5h6ms9PQszuy4A

initial code
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

link = 'https://drive.google.com/file/d/1hJGgFSvtRsNREGPVjkSTLZqOzNc0okgv/view?usp=sharing'
id = link.split('/')[-2]
print(id)

import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

# Enable mixed precision for A100 optimization
from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')

path = "/content/drive/MyDrive/all_train.csv"

import tensorflow as tf
import numpy as np

data_file = path
temp_data_set = tf.data.experimental.make_csv_dataset(
    data_file,
    batch_size=1000,
    num_epochs=1,
    label_name='# label',
    ignore_errors=True,
)

def pack(features, label):
    return tf.stack(list(features.values()), axis=-1), tf.cast(label, tf.int32)

packed_dataset = temp_data_set.map(pack)

for features, labels in packed_dataset.take(1):
    print(features.shape)
    print(np.unique(labels.numpy()))
    print(len(features.numpy()))
    print(labels.numpy())

from time import time
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping, TensorBoard

my_model = Sequential()
my_model.add(tf.keras.Input(shape=(28,)))
my_model.add(BatchNormalization())
my_model.add(Dense(500, activation='relu'))
my_model.add(BatchNormalization())
my_model.add(Dense(400, activation='relu'))
my_model.add(BatchNormalization())
my_model.add(Dense(300, activation='relu'))
my_model.add(BatchNormalization())
my_model.add(Dense(200, activation='relu'))
my_model.add(BatchNormalization())
my_model.add(Dense(100, activation='relu'))
my_model.add(BatchNormalization())
my_model.add(Dense(1, activation='sigmoid', dtype='float32'))  # output must be float32

train_size = int(0.8 * 1000)
val_size = int(0.2 * 1000)

full_dataset = packed_dataset.shuffle(buffer_size=1000)
train_dataset = full_dataset.take(train_size)
val_dataset = full_dataset.skip(val_size)

from tensorflow.keras.optimizers import Adam
opt = Adam()

my_model.compile(optimizer=opt, loss=tf.keras.losses.BinaryCrossentropy(), metrics=[
    'accuracy',
    tf.keras.metrics.FalseNegatives(name='false_negatives_1'),
    tf.keras.metrics.FalsePositives(name='false_positives_1'),
    tf.keras.metrics.AUC(name='auc_1'),
    tf.keras.metrics.Precision(name='precision_1'),
    tf.keras.metrics.Recall(name='recall_1')
])

import datetime
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tb = TensorBoard(log_dir=log_dir, histogram_freq=1)


safety = EarlyStopping(patience=3, monitor='val_loss')

history = my_model.fit(
    packed_dataset.take(train_size),
    validation_data=packed_dataset.skip(train_size),
    epochs=1000,
    callbacks=[safety, tb]
)

print(history.history.keys())

plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

# mount
from google.colab import drive
drive.mount('/content/drive')

# install pydrive2

# Authenticate PyDrive2
from pydrive2.auth import GoogleAuth
from pydrive2.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')

# Enable Mixed Precision (for A100 GPU)
import tensorflow as tf

data_file = "/content/drive/MyDrive/all_train.csv.gz"

temp_data_set = tf.data.experimental.make_csv_dataset(
    data_file,
    compression_type='GZIP',
    batch_size=1000,
    num_epochs=1,
    label_name='# label',
    ignore_errors=True,
)

#load and Stream the GZIP Dataset from Drive
import tensorflow as tf

data_file = "/content/drive/MyDrive/all_train.csv.gz"

temp_data_set = tf.data.experimental.make_csv_dataset(
    data_file,
    compression_type='GZIP',
    batch_size=1000,
    num_epochs=1,
    label_name='# label',
    ignore_errors=True,
)

#  Pack Features into Tensor Format
def pack(features, label):
    return tf.stack(list(features.values()), axis=-1), tf.cast(label, tf.int32)

packed_dataset = temp_data_set.map(pack)

# Preview the Dataset (Optional Sanity Check)
for features, labels in packed_dataset.take(1):
    print(features.shape)  # should be (1000, 28)
    print(tf.reduce_min(labels), tf.reduce_max(labels))  # should be 0 and 1

train_size = 800
val_size = 200

full_dataset = packed_dataset.shuffle(buffer_size=1000)
train_dataset = full_dataset.take(train_size).repeat()
val_dataset = full_dataset.skip(train_size).repeat()

# Build the Neural Network Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization

model = Sequential([
    tf.keras.Input(shape=(28,)),
    BatchNormalization(),
    Dense(500, activation='relu'),
    BatchNormalization(),
    Dense(400, activation='relu'),
    BatchNormalization(),
    Dense(300, activation='relu'),
    BatchNormalization(),
    Dense(200, activation='relu'),
    BatchNormalization(),
    Dense(100, activation='relu'),
    BatchNormalization(),
    Dense(1, activation='sigmoid', dtype='float32')  # force final output to float32
])

# compile
from tensorflow.keras.optimizers import Adam

model.compile(
    optimizer=Adam(),
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=[
        'accuracy',
        tf.keras.metrics.FalseNegatives(name='false_negatives_1'),
        tf.keras.metrics.FalsePositives(name='false_positives_1'),
        tf.keras.metrics.AUC(name='auc_1'),
        tf.keras.metrics.Precision(name='precision_1'),
        tf.keras.metrics.Recall(name='recall_1')
    ]
)

# Set Up TensorBoard Logging and Early Stopping
from tensorflow.keras.callbacks import TensorBoard, EarlyStopping
import datetime

log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tb = TensorBoard(log_dir=log_dir, histogram_freq=1)
safety = EarlyStopping(patience=3, monitor='val_loss')

# # train the model
# history = model.fit(
#     train_dataset,
#     validation_data=val_dataset,
#     steps_per_epoch=800,
#     validation_steps=200,
#     epochs=1000,
#     callbacks=[safety, tb]
# )

# # faster - train the model

# history = model.fit(
#     train_dataset,
#     validation_data=val_dataset,
#     steps_per_epoch=800,
#     validation_steps=200,
#     epochs=1000,
#     callbacks=[safety, tb],
#     verbose=0  # fastest screen-wise
# )

# use cv not tts:
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
import numpy as np

# Load full dataset once
df = pd.read_csv("/content/drive/MyDrive/all_train.csv.gz", compression='gzip')
# X = df.drop(columns=['# label', 'mass']).values #(comment out for use w cv - use for tts)

X = df.drop(columns=['# label']).values #(use with CV comment out for tts)

y = df['# label'].values

# K-Fold setup
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

all_metrics = []

for fold, (train_idx, val_idx) in enumerate(kf.split(X, y)):
    print(f"\nFOLD {fold + 1}")
    x_train, x_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]

    # Build a new model for each fold
    model = Sequential([
        tf.keras.Input(shape=(28,)),
        BatchNormalization(),
        Dense(500, activation='relu'),
        BatchNormalization(),
        Dense(400, activation='relu'),
        BatchNormalization(),
        Dense(300, activation='relu'),
        BatchNormalization(),
        Dense(200, activation='relu'),
        BatchNormalization(),
        Dense(100, activation='relu'),
        BatchNormalization(),
        Dense(1, activation='sigmoid', dtype='float32'),
    ])

    model.compile(
        optimizer=Adam(),
        loss='binary_crossentropy',
        metrics=['accuracy', tf.keras.metrics.AUC(name='auc'), tf.keras.metrics.Precision(), tf.keras.metrics.Recall()]
    )

    model.fit(x_train, y_train, epochs=3, batch_size=1000, verbose=0)

    # Evaluate
    y_pred = model.predict(x_val) > 0.5
    acc = accuracy_score(y_val, y_pred)
    prec = precision_score(y_val, y_pred)
    rec = recall_score(y_val, y_pred)
    auc = roc_auc_score(y_val, y_pred)
    all_metrics.append((acc, prec, rec, auc))
    print(f"ACC: {acc:.4f}  PREC: {prec:.4f}  REC: {rec:.4f}  AUC: {auc:.4f}")

# Fold metrics from output
fold_metrics = [
    (0.8841, 0.8737, 0.8980, 0.8841),
    (0.8835, 0.8615, 0.9140, 0.8835),
    (0.8836, 0.8762, 0.8935, 0.8836),
    (0.8841, 0.8698, 0.9036, 0.8841),
    (0.8834, 0.8630, 0.9116, 0.8834),
]

accs, precs, recs, aucs = zip(*fold_metrics)

print(f"Mean Accuracy:     {np.mean(accs):.4f}")
print(f"Mean Precision:    {np.mean(precs):.4f}")
print(f"Mean Recall:       {np.mean(recs):.4f}")
print(f"Mean AUC:          {np.mean(aucs):.4f}")

# Commented out IPython magic to ensure Python compatibility.
next
#     %load_ext tensorboard
# %tensorboard --logdir logs

#  Plot Training History (Accuracy + Loss)
import matplotlib.pyplot as plt

# Plot accuracy
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Accuracy over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()

# Plot loss
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()

history = model.fit(
    train_dataset,
    validation_data=val_dataset,
    steps_per_epoch=800,
    validation_steps=200,
    epochs=3,
    callbacks=[tb]  # Skip early stopping so it finishes all 3 epochs
)

import matplotlib.pyplot as plt

# Accuracy
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Val Accuracy')
plt.title('Accuracy over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()

# Loss
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Val Loss')
plt.title('Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()

# evaluate on validaion data:
results = model.evaluate(val_dataset, steps=200)
for name, value in zip(model.metrics_names, results):
    print(f"{name}: {value:.4f}")

# Save the model
model.save('/content/drive/MyDrive/saved_particle_model')

#  Export as a .zip file for download or transfer
import shutil
shutil.make_archive('/content/particle_model_export', 'zip', '/content/drive/MyDrive/saved_particle_model')

#  Generate classification metrics (for the case study)
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Get predictions from model on validation data
y_true = []
y_pred = []

for features, labels in val_dataset.take(200):
    preds = model.predict(features)
    y_true.extend(labels.numpy())
    y_pred.extend((preds > 0.5).astype(int).flatten())

# Metrics
print("Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))

print("\nClassification Report:")
print(classification_report(y_true, y_pred, digits=4))

print("\nAUC-ROC:")
print(f"AUC: {roc_auc_score(y_true, y_pred):.4f}")

# Commented out IPython magic to ensure Python compatibility.

import matplotlib.pyplot as plt

# Save accuracy plot
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Val Accuracy')
plt.title('Accuracy over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.savefig('/content/accuracy_plot.png')
plt.close()

# Save loss plot
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Val Loss')
plt.title('Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.savefig('/content/loss_plot.png')
plt.close()

```




---

## Appendix: Full Python Code

```{python}
# mount
from google.colab import drive
drive.mount('/content/drive')
```

```{python}
!ls "/content/drive/MyDrive/Colab Notebooks/texaschikkita-particle-detection.ipynb"
```

```{python}
!ls "/content/drive/MyDrive/Colab_Notebooks/texaschikkita-particle-detection.ipynb"
```

```{python}
# install pydrive2
!pip install -U -q PyDrive2
```

```{python}
# Authenticate PyDrive2
from pydrive2.auth import GoogleAuth
from pydrive2.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
```

```{python}
from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')
```

```{python}
# Enable Mixed Precision (for A100 GPU)
import tensorflow as tf

data_file = "/content/drive/MyDrive/all_train.csv.gz"

temp_data_set = tf.data.experimental.make_csv_dataset(
    data_file,
    compression_type='GZIP',
    batch_size=1000,
    num_epochs=1,
    label_name='# label',
    ignore_errors=True,
)
```

```{python}
#load and Stream the GZIP Dataset from Drive
import tensorflow as tf

data_file = "/content/drive/MyDrive/all_train.csv.gz"

temp_data_set = tf.data.experimental.make_csv_dataset(
    data_file,
    compression_type='GZIP',
    batch_size=1000,
    num_epochs=1,
    label_name='# label',
    ignore_errors=True,
)
```

```{python}
#  Pack Features into Tensor Format
def pack(features, label):
    return tf.stack(list(features.values()), axis=-1), tf.cast(label, tf.int32)

packed_dataset = temp_data_set.map(pack)
```

```{python}
# Preview the Dataset (Optional Sanity Check)
for features, labels in packed_dataset.take(1):
    print(features.shape)  # should be (1000, 28)
    print(tf.reduce_min(labels), tf.reduce_max(labels))  # should be 0 and 1
```

```{python}
train_size = 800
val_size = 200

full_dataset = packed_dataset.shuffle(buffer_size=1000)
train_dataset = full_dataset.take(train_size).repeat()
val_dataset = full_dataset.skip(train_size).repeat()
```

```{python}
# Build the Neural Network Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization

model = Sequential([
    tf.keras.Input(shape=(28,)),
    BatchNormalization(),
    Dense(500, activation='relu'),
    BatchNormalization(),
    Dense(400, activation='relu'),
    BatchNormalization(),
    Dense(300, activation='relu'),
    BatchNormalization(),
    Dense(200, activation='relu'),
    BatchNormalization(),
    Dense(100, activation='relu'),
    BatchNormalization(),
    Dense(1, activation='sigmoid', dtype='float32')  # force final output to float32
])
```

```{python}
# compile
from tensorflow.keras.optimizers import Adam

model.compile(
    optimizer=Adam(),
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=[
        'accuracy',
        tf.keras.metrics.FalseNegatives(name='false_negatives_1'),
        tf.keras.metrics.FalsePositives(name='false_positives_1'),
        tf.keras.metrics.AUC(name='auc_1'),
        tf.keras.metrics.Precision(name='precision_1'),
        tf.keras.metrics.Recall(name='recall_1')
    ]
)
```

```{python}
# Set Up TensorBoard Logging and Early Stopping
from tensorflow.keras.callbacks import TensorBoard, EarlyStopping
import datetime

log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tb = TensorBoard(log_dir=log_dir, histogram_freq=1)
safety = EarlyStopping(patience=3, monitor='val_loss')
```

```{python}
# # train the model
# history = model.fit(
#     train_dataset,
#     validation_data=val_dataset,
#     steps_per_epoch=800,
#     validation_steps=200,
#     epochs=1000,
#     callbacks=[safety, tb]
# )
```

```{python}
# # faster - train the model

# history = model.fit(
#     train_dataset,
#     validation_data=val_dataset,
#     steps_per_epoch=800,
#     validation_steps=200,
#     epochs=1000,
#     callbacks=[safety, tb],
#     verbose=0  # fastest screen-wise
# )
```

```{python}
# use cv not tts:
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
import numpy as np

# Load full dataset once
df = pd.read_csv("/content/drive/MyDrive/all_train.csv.gz", compression='gzip')
# X = df.drop(columns=['# label', 'mass']).values #(comment out for use w cv - use for tts)

X = df.drop(columns=['# label']).values #(use with CV comment out for tts)

y = df['# label'].values

# K-Fold setup
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

all_metrics = []

for fold, (train_idx, val_idx) in enumerate(kf.split(X, y)):
    print(f"\nFOLD {fold + 1}")
    x_train, x_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]

    # Build a new model for each fold
    model = Sequential([
        tf.keras.Input(shape=(28,)),
        BatchNormalization(),
        Dense(500, activation='relu'),
        BatchNormalization(),
        Dense(400, activation='relu'),
        BatchNormalization(),
        Dense(300, activation='relu'),
        BatchNormalization(),
        Dense(200, activation='relu'),
        BatchNormalization(),
        Dense(100, activation='relu'),
        BatchNormalization(),
        Dense(1, activation='sigmoid', dtype='float32'),
    ])

    model.compile(
        optimizer=Adam(),
        loss='binary_crossentropy',
        metrics=['accuracy', tf.keras.metrics.AUC(name='auc'), tf.keras.metrics.Precision(), tf.keras.metrics.Recall()]
    )

    model.fit(x_train, y_train, epochs=3, batch_size=1000, verbose=0)

    # Evaluate
    y_pred = model.predict(x_val) > 0.5
    acc = accuracy_score(y_val, y_pred)
    prec = precision_score(y_val, y_pred)
    rec = recall_score(y_val, y_pred)
    auc = roc_auc_score(y_val, y_pred)
    all_metrics.append((acc, prec, rec, auc))
    print(f"ACC: {acc:.4f}  PREC: {prec:.4f}  REC: {rec:.4f}  AUC: {auc:.4f}")
```

```{python}
# Fold metrics from output
fold_metrics = [
    (0.8841, 0.8737, 0.8980, 0.8841),
    (0.8835, 0.8615, 0.9140, 0.8835),
    (0.8836, 0.8762, 0.8935, 0.8836),
    (0.8841, 0.8698, 0.9036, 0.8841),
    (0.8834, 0.8630, 0.9116, 0.8834),
]

accs, precs, recs, aucs = zip(*fold_metrics)

print(f"Mean Accuracy:     {np.mean(accs):.4f}")
print(f"Mean Precision:    {np.mean(precs):.4f}")
print(f"Mean Recall:       {np.mean(recs):.4f}")
print(f"Mean AUC:          {np.mean(aucs):.4f}")
```

```{python}
next
    %load_ext tensorboard
%tensorboard --logdir logs
```

```{python}
#  Plot Training History (Accuracy + Loss)
import matplotlib.pyplot as plt

# Plot accuracy
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Accuracy over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()

# Plot loss
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()
```

```{python}
history = model.fit(
    train_dataset,
    validation_data=val_dataset,
    steps_per_epoch=800,
    validation_steps=200,
    epochs=3,
    callbacks=[tb]  # Skip early stopping so it finishes all 3 epochs
)
```

```{python}
import matplotlib.pyplot as plt

# Accuracy
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Val Accuracy')
plt.title('Accuracy over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()

# Loss
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Val Loss')
plt.title('Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()
```

```{python}
# evaluate on validaion data:
results = model.evaluate(val_dataset, steps=200)
for name, value in zip(model.metrics_names, results):
    print(f"{name}: {value:.4f}")
```

```{python}
# Save the model
model.save('/content/drive/MyDrive/saved_particle_model')
```

```{python}
#  Export as a .zip file for download or transfer
import shutil
shutil.make_archive('/content/particle_model_export', 'zip', '/content/drive/MyDrive/saved_particle_model')
```

```{python}
#  Generate classification metrics (for the case study)
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Get predictions from model on validation data
y_true = []
y_pred = []

for features, labels in val_dataset.take(200):
    preds = model.predict(features)
    y_true.extend(labels.numpy())
    y_pred.extend((preds > 0.5).astype(int).flatten())

# Metrics
print("Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))

print("\nClassification Report:")
print(classification_report(y_true, y_pred, digits=4))

print("\nAUC-ROC:")
print(f"AUC: {roc_auc_score(y_true, y_pred):.4f}")
```


