Spring 2025

Building Classifiers Using SciKit-Learn

The SciKit-Learn Package

  • One multi-purpose, general machine learning package for python is SciKit-Learn (sklearn)
  • It contains a lot of general-purpose ML functions (e.g., splitting data)
  • It also has a lot of machine learning modeling tools built in
  • It relies on numpy and is relatively easy to install: pip install scikit-learn

Breast Cancer Dataset

  • There is a well-known (small) data set about tumor biopsy for diagnosing breast cancer
  • Class: malignant, benign
  • Thirty real-valued attributes, including measures for the average radius of a tumor, average area, etc.
  • SciKit-Learn gives it to us in Python:
import sklearn.datasets as skds
cancer = skds.load_breast_cancer()
print("Target Values:       ", cancer.target_names)
print("Shape of data:       ", cancer.data.shape)
print("Attribute Variables: ", cancer.feature_names)

Breast Cancer Data Summaries

## Target Values:        ['malignant' 'benign']
## Shape of data:        (569, 30)
## Attribute Variables:  ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
##  'mean smoothness' 'mean compactness' 'mean concavity'
##  'mean concave points' 'mean symmetry' 'mean fractal dimension'
##  'radius error' 'texture error' 'perimeter error' 'area error'
##  'smoothness error' 'compactness error' 'concavity error'
##  'concave points error' 'symmetry error' 'fractal dimension error'
##  'worst radius' 'worst texture' 'worst perimeter' 'worst area'
##  'worst smoothness' 'worst compactness' 'worst concavity'
##  'worst concave points' 'worst symmetry' 'worst fractal dimension']

Splitting Up The data

  • Recall that we’d like to divide our data into training and testing:
  • Each set will have both the input data (X) and the target data (y)
  • So there will be four data sets
  • SciKit-Learn gives us a way to randomly assign these:
from sklearn.model_selection import train_test_split
trainX, testX, trainY, testY = train_test_split(cancer.data, cancer.target, test_size=0.4, random_state=1)
print("Training data shape:    ", trainX.shape)
print("Training target shape:  ", trainY.shape)
print("Testing data shape:     ", testX.shape)
print("Testing target shape:   ", testY.shape)

Classifying The Data With Support Vector Machines

Support vector machines are machine learning models that try to find the optimal decision surface between positive and negative points. This one is a linear surface:

import sklearn.svm as svm
import sklearn.metrics as metrics

# Build then fit the model
model = svm.SVC(kernel='linear', C=1000)
fit = model.fit(trainX, trainY)

# Predict with the model
model.predict(testX)

# Evaluate the model:
metrics.accuracy_score(model.predict(testX), testY)

Classifying The Data With Naive Bayes

To d0

import sklearn.naive_bayes as nb

# Build then fit the model
model = nb.GaussianNB()
fit = model.fit(trainX, trainY)

# Predict with the model
model.predict(testX)

# Evaluate the model:
metrics.accuracy_score(model.predict(testX), testY)

Classifying The Data With Decision Trees

Decision Trees use information theory to build a tree that decides how to classify based on most informative variable values

import sklearn.tree as dt

# Build then fit the model
model = dt.DecisionTreeClassifier()
fit = model.fit(trainX, trainY)

# Predict with the model
model.predict(testX)

# Evaluate the model:
metrics.accuracy_score(model.predict(testX), testY)

Using TensorFlow for Neural Networks

Setup Tensor Flow on Hopper

  • SciKit-Learn doesn’t have good neural network modeling tools
  • For neural network models, we’ll use TensorFlow2
  • TensorFlow2 can be difficult to setup, but Hopper has most of what you need
  • I have a pre-built venv you can use, do this before you use Python:
source /data/shared-venvs/tensorflow-standard/bin/activate

Revisiting / Reshaping the Breast Cancer Data

  • Let’s send our 30-variables into a dense, feed-forward neural network
  • We’ll want the target values as “one-hot”: A distribution over possible target values
  • So the target value “malignant” because [1,0], and “benign” becomes [0,1]
import sklearn.datasets as skds
from sklearn.model_selection import train_test_split
cancer = skds.load_breast_cancer()
trainX, testX, trainY, testY = train_test_split(cancer.data, cancer.target, test_size=0.4, random_state=1)
trainY = tf.keras.utils.to_categorical(trainY)  # Turn the targets into one-hot representation
testY  = tf.keras.utils.to_categorical(testY)   # Turn the targets into one-hot representation

Building a Neural Network for the Cancer Data

  • Neural network architectures are constructed in layers, starting with the input layer
  • We’ll use three dense layers, each of size 50 nodes
  • The output layer will have 2 different outputs (one for “malignant”, one for “benign”)
  • We interpret that result as a probability distribution over those two possibilities using softmax:
model = tf.keras.models.Sequential()
model.add( tf.keras.layers.Input(shape=(30,)) )
model.add( tf.keras.layers.Dense(50, activation="relu") )
model.add( tf.keras.layers.Dense(50, activation="relu") )
model.add( tf.keras.layers.Dense(50, activation="relu") )
model.add( tf.keras.layers.Dense(2, activation="softmax") )

Fitting the Model

  • Neural networks are learned using something called gradient descent (here: Adam)
  • That is, they climbing down a surface called a loss function
  • We’re learning probability distributions, so we’ll use cross entropy loss, which punishes confidently wrong answers the most
  • But humans don’t understand crossentropy values well, so we’ll report performance in terms of accuracy
  • TF/Keras models have to be compiled (for potential GPU deployment), then fit:
model.compile(optimizer="adam", loss="BinaryCrossentropy", metrics=["accuracy"])
trainingHistory = model.fit(trainX, trainY, epochs=50)    # Learn the model
model.evaluate(testX, testY)                              # Evaluate test-set performance

The MNIST Dataset

  • Tensflow/Keras have the MNIST dataset
  • Set of 28x28 pixel images of hand-written numeric digits
  • Class Values: 0, …, 9
  • Attributes: 28x28 greyscale images
import tensorfloat as tf # Ignore the warnings it will spew

# Get the MNIST data, convert them to float and scale the attribute data to be between 0 and 1
(trainX, trainY), (testX, testY) = tf.keras.datasets.mnist.load_data()
trainX= trainX.astype('float32') / 255.0
testX = testX.astype('float32') / 255.0

Reshaping the Data

  • Let’s use convolutional neural networks, a common image-analysis technique
  • We’ll need to convert the images to tensors – cubes of data
  • Also, we’ll want the target values as “one-hot”: A distribution over possible target values
  • So the target value 2 becomes [0, 0, 1, 0, 0, 0, 0 ,0 ,0 ,0, 0], etc.
trainX = trainX.reshape( (-1, 28, 28, 1))  # Make each image a 28x28x1 cube
testX  = testX.reshape( (-1, 28, 28, 1))   # Make each image a 28x28x1 cube
trainY = tf.keras.utils.to_categorical(trainY)  # Turn the targets into one-hot representation
testY  = tf.keras.utils.to_categorical(testY)   # Turn the targets into one-hot representation

Building the Model

  • Start with the input layer taking 28x28x1 tensors
  • Use a layer to extract image features with 70 different convolutional filters of size 3x3
  • Then down-sample the resulting images by half in width and height using something called Max Pooling
  • After that, we flatten it and turn it into a traditional neural network
  • The output layer will have 10 different outputs (one for each digit), and we’ll turn the result into a probability distribution using softmax:
model = tf.keras.models.Sequential()
model.add( tf.keras.layers.Input(shape=(28,28,1)) )
model.add( tf.keras.layers.Conv2D(70, (3,3), activation="relu") )
model.add( tf.keras.layers.MaxPooling2D((2, 2)) )
model.add( tf.keras.layers.Flatten() )
model.add( tf.keras.layers.Dense(70, activation="relu") )
model.add( tf.keras.layers.Dense(10, activation="softmax") )

Fitting the Model

  • Neural networks are learned using something called gradient descent (here: Adam)
  • That is, they climbin down a surface called a loss function
  • We’re learning probability distributions, so we’ll use cross entropy loss, which punishes confidently wrong answers the most
  • But humans don’t understand crossentropy values well, so we’ll report performance in terms of accuracy
  • TF/Keras models have to be compiled (for potential GPU deployment), then fit:
model.compile(optimizer="adam", loss="CategoricalCrossentropy", metrics=["accuracy"])
trainingHistory = model.fit(trainX, trainY, epochs=10)    # Learn the model
model.evaluate(testX, testY)                              # Evaluate test-set performance

Machine Learning in Julia

Machine Learning Packages for Julia

  • MLJ – A general framework for machine learning
  • ScikitLearn – A wrapper around the Python sklearn package
  • Flux – A native Julia Neural Network library
  • Tensorflow – A wrapper around the Python Tensorflow package

More Info: https://juliapackages.com/c/machine-learning

MLJ

  • The MLJ package is a framework of many libraries for machine learning
  • It also comes with some built-in data
  • And convenient ways to split the data
using MLJ
using DataFrames

iris = DataFrames.DataFrame(load_iris());
y, X = unpack(iris, ==(:target); rng=123);   # Strip the target column off
(trainX, testX), (trainY, testY) = MLJ.partition( (X,y), 0.75, multi=true);  # Test/train split
models(matching(X,y))  # Show all models that might be applied to this data

More Info: https://juliaai.github.io/MLJ.jl/dev/getting_started/#Getting-Started

Loading MLJ Libraries

  • To load a specific library from MLJ:
    • Add the specific MLJLIBInterface package
    • Use the @load annotation to load a specific component from that library
Pkg.add("MLJLIBSVMInterface")  # Just need to do this once for anything in LIBSVM
SVC = @load SVC pkg=LIBSVM     # To get access to the SVM classifier modeling tool
import LIBSVM # To have access to LIBSVM mechanisms, such as different kernels, etc

Building & Evaluating an SVM Classifier

model = SVC(kernel=LIBSVM.Kernel.Polynomial)        # Create a specific SVM model
fit_model = machine(model, trainX, trainY) |> fit!  # Using training data to learn
predict(fit_model, testX)                           # Predict classes for test set
evaluate(model, testX, testY; resampling=CV(nfolds=2, rng=888), measure=[accuracy])

Building & Evaluating a Decision Tree Classifier

DecisionTreeClassifier = @load DecisionTreeClassifier pkg=DecisionTree
model = DecisionTreeClassifier()                    # Create a decision tree model
fit_model = machine(model, trainX, trainY) |> fit!  # Using training data to learn
predict(fit_model, testX)                           # Predict classes for test set
evaluate(model, testX, testY; resampling=CV(nfolds=2, rng=888), measure=[accuracy])

Using Flux for Deep Learning

  • You’ll need to install a few packages: Flux, Statistics
  • You might want these too: ProgressMeter, CUDA
using Flux, Statistics
using ProgressMeter, CUDA  # optional
device = gpu_device()      # If using the GPU

# Make some fake data:
trainX = rand(Float32, 2, 500);                                     # 2×500 Matrix
trainY = [xor(col[1]>0.5, col[2]>0.5) for col in eachcol(trainX)];  # XOR t/f for each of those
testX = rand(Float32, 2, 250);                                      # 2×250 Matrix
testY = [xor(col[1]>0.5, col[2]>0.5) for col in eachcol(testX)];    # XOR t/f for each of those

More Info: https://fluxml.ai/Flux.jl/stable/guide/models/quickstart/

Setup a Neural Network Model Using Flux

# Setup model: Two-layer MLP:  2 inputs, 3 hidden nodes, 2 outputs
model = Chain( Dense(2 => 3, tanh),   # 2 inputs, 3 hidden nodes, hyperbolic tangent activation
               BatchNorm(3),          # Rescale/normalize weights to keep them stable
               Dense(3 => 2)) |> device  # Output layer, 2D signal ... send model to the GPU

# Setup Data for Flux's learning system
trainYoh = Flux.onehotbatch(trainY, [true, false]);  # Turn true/false into one-hot rep

# Create the batch loader & the optimizer for learning
loader = Flux.DataLoader( (trainX, trainYoh), batchsize=64, shuffle=true);
optimizer = Flux.setup(Flux.Adam(0.01), model);

ANN Learning

losses = []  # Store loss values as you learn
@showprogress for epoch in 1:1_000      # Omit the showprogress if not using pkg
  for batch_sample in loader
    # Grab samples from batch, put it into the GPU
    x, y = batch_sample |> device
    loss, gradients = Flux.withgradient(model) do midstep_model
      y_hat = midstep_model(x)          # Apply the current model to x
      Flux.logitcrossentropy(y_hat, y)  # Collect crossentropy loss over the batch
    end
    Flux.update!( optimizer,  model, gradients[1]) # Use optimizer to adjust wts
    push!(losses, loss)  # Record this epoch's losses
  end
end

Evaluation of the Model

# How accurate is the model on training data?
trainOut = model(trainX |> device);   # Training data to GPU, apply model get output
trainProb = softmax(trainOut) |> cpu; # Convert output to probabilities, put back on CPU
mean( (trainProb[1,:] .> 0.5) .== trainY )  # Compute accuracy of prediction

# How accurate is the model on testing data?
testOut = model(testX |> device);    # Testing data to GPU, apply model get output
testProb = softmax(testOut) |> cpu;  # Convert output to probabilities, put back on CPU
mean( (testProb[1,:] .> 0.5) .== testY )  # Compute accuracy of prediction