Analysis, Part 2

Spring 2025

Building Classifiers Using SciKit-Learn

The SciKit-Learn Package

One multi-purpose, general machine learning package for python is SciKit-Learn (sklearn)
It contains a lot of general-purpose ML functions (e.g., splitting data)
It also has a lot of machine learning modeling tools built in
It relies on numpy and is relatively easy to install: pip install scikit-learn

Breast Cancer Dataset

There is a well-known (small) data set about tumor biopsy for diagnosing breast cancer
Class: malignant, benign
Thirty real-valued attributes, including measures for the average radius of a tumor, average area, etc.
SciKit-Learn gives it to us in Python:

import sklearn.datasets as skds
cancer = skds.load_breast_cancer()
print("Target Values:       ", cancer.target_names)
print("Shape of data:       ", cancer.data.shape)
print("Attribute Variables: ", cancer.feature_names)

Breast Cancer Data Summaries

## Target Values:        ['malignant' 'benign']

## Shape of data:        (569, 30)

## Attribute Variables:  ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
##  'mean smoothness' 'mean compactness' 'mean concavity'
##  'mean concave points' 'mean symmetry' 'mean fractal dimension'
##  'radius error' 'texture error' 'perimeter error' 'area error'
##  'smoothness error' 'compactness error' 'concavity error'
##  'concave points error' 'symmetry error' 'fractal dimension error'
##  'worst radius' 'worst texture' 'worst perimeter' 'worst area'
##  'worst smoothness' 'worst compactness' 'worst concavity'
##  'worst concave points' 'worst symmetry' 'worst fractal dimension']

Splitting Up The data

Recall that we’d like to divide our data into training and testing:
Each set will have both the input data (X) and the target data (y)
So there will be four data sets
SciKit-Learn gives us a way to randomly assign these:

from sklearn.model_selection import train_test_split
trainX, testX, trainY, testY = train_test_split(cancer.data, cancer.target, test_size=0.4, random_state=1)
print("Training data shape:    ", trainX.shape)
print("Training target shape:  ", trainY.shape)
print("Testing data shape:     ", testX.shape)
print("Testing target shape:   ", testY.shape)

Classifying The Data With Support Vector Machines

Support vector machines are machine learning models that try to find the optimal decision surface between positive and negative points. This one is a linear surface:

import sklearn.svm as svm
import sklearn.metrics as metrics

# Build then fit the model
model = svm.SVC(kernel='linear', C=1000)
fit = model.fit(trainX, trainY)

# Predict with the model
model.predict(testX)

# Evaluate the model:
metrics.accuracy_score(model.predict(testX), testY)

Classifying The Data With Naive Bayes

To d0

import sklearn.naive_bayes as nb

# Build then fit the model
model = nb.GaussianNB()
fit = model.fit(trainX, trainY)

# Predict with the model
model.predict(testX)

# Evaluate the model:
metrics.accuracy_score(model.predict(testX), testY)

Classifying The Data With Decision Trees

Decision Trees use information theory to build a tree that decides how to classify based on most informative variable values

import sklearn.tree as dt

# Build then fit the model
model = dt.DecisionTreeClassifier()
fit = model.fit(trainX, trainY)

# Predict with the model
model.predict(testX)

# Evaluate the model:
metrics.accuracy_score(model.predict(testX), testY)

Using TensorFlow for Neural Networks

Setup Tensor Flow on Hopper

SciKit-Learn doesn’t have good neural network modeling tools
For neural network models, we’ll use TensorFlow2
TensorFlow2 can be difficult to setup, but Hopper has most of what you need
I have a pre-built venv you can use, do this before you use Python:

source /data/shared-venvs/tensorflow-standard/bin/activate

Revisiting / Reshaping the Breast Cancer Data

Let’s send our 30-variables into a dense, feed-forward neural network
We’ll want the target values as “one-hot”: A distribution over possible target values
So the target value “malignant” because [1,0], and “benign” becomes [0,1]

import sklearn.datasets as skds
from sklearn.model_selection import train_test_split
cancer = skds.load_breast_cancer()
trainX, testX, trainY, testY = train_test_split(cancer.data, cancer.target, test_size=0.4, random_state=1)
trainY = tf.keras.utils.to_categorical(trainY)  # Turn the targets into one-hot representation
testY  = tf.keras.utils.to_categorical(testY)   # Turn the targets into one-hot representation

Building a Neural Network for the Cancer Data

Neural network architectures are constructed in layers, starting with the input layer
We’ll use three dense layers, each of size 50 nodes
The output layer will have 2 different outputs (one for “malignant”, one for “benign”)
We interpret that result as a probability distribution over those two possibilities using softmax:

model = tf.keras.models.Sequential()
model.add( tf.keras.layers.Input(shape=(30,)) )
model.add( tf.keras.layers.Dense(50, activation="relu") )
model.add( tf.keras.layers.Dense(50, activation="relu") )
model.add( tf.keras.layers.Dense(50, activation="relu") )
model.add( tf.keras.layers.Dense(2, activation="softmax") )

Fitting the Model

Neural networks are learned using something called gradient descent (here: Adam)
That is, they climbing down a surface called a loss function
We’re learning probability distributions, so we’ll use cross entropy loss, which punishes confidently wrong answers the most
But humans don’t understand crossentropy values well, so we’ll report performance in terms of accuracy
TF/Keras models have to be compiled (for potential GPU deployment), then fit:

model.compile(optimizer="adam", loss="BinaryCrossentropy", metrics=["accuracy"])
trainingHistory = model.fit(trainX, trainY, epochs=50)    # Learn the model
model.evaluate(testX, testY)                              # Evaluate test-set performance

The MNIST Dataset

Tensflow/Keras have the MNIST dataset
Set of 28x28 pixel images of hand-written numeric digits
Class Values: 0, …, 9
Attributes: 28x28 greyscale images

import tensorfloat as tf # Ignore the warnings it will spew

# Get the MNIST data, convert them to float and scale the attribute data to be between 0 and 1
(trainX, trainY), (testX, testY) = tf.keras.datasets.mnist.load_data()
trainX= trainX.astype('float32') / 255.0
testX = testX.astype('float32') / 255.0

Reshaping the Data

Let’s use convolutional neural networks, a common image-analysis technique
We’ll need to convert the images to tensors – cubes of data
Also, we’ll want the target values as “one-hot”: A distribution over possible target values
So the target value 2 becomes [0, 0, 1, 0, 0, 0, 0 ,0 ,0 ,0, 0], etc.

trainX = trainX.reshape( (-1, 28, 28, 1))  # Make each image a 28x28x1 cube
testX  = testX.reshape( (-1, 28, 28, 1))   # Make each image a 28x28x1 cube
trainY = tf.keras.utils.to_categorical(trainY)  # Turn the targets into one-hot representation
testY  = tf.keras.utils.to_categorical(testY)   # Turn the targets into one-hot representation

Building the Model

Start with the input layer taking 28x28x1 tensors
Use a layer to extract image features with 70 different convolutional filters of size 3x3
Then down-sample the resulting images by half in width and height using something called Max Pooling
After that, we flatten it and turn it into a traditional neural network
The output layer will have 10 different outputs (one for each digit), and we’ll turn the result into a probability distribution using softmax:

model = tf.keras.models.Sequential()
model.add( tf.keras.layers.Input(shape=(28,28,1)) )
model.add( tf.keras.layers.Conv2D(70, (3,3), activation="relu") )
model.add( tf.keras.layers.MaxPooling2D((2, 2)) )
model.add( tf.keras.layers.Flatten() )
model.add( tf.keras.layers.Dense(70, activation="relu") )
model.add( tf.keras.layers.Dense(10, activation="softmax") )

Fitting the Model

Neural networks are learned using something called gradient descent (here: Adam)
That is, they climbin down a surface called a loss function
We’re learning probability distributions, so we’ll use cross entropy loss, which punishes confidently wrong answers the most
But humans don’t understand crossentropy values well, so we’ll report performance in terms of accuracy
TF/Keras models have to be compiled (for potential GPU deployment), then fit:

model.compile(optimizer="adam", loss="CategoricalCrossentropy", metrics=["accuracy"])
trainingHistory = model.fit(trainX, trainY, epochs=10)    # Learn the model
model.evaluate(testX, testY)                              # Evaluate test-set performance

Machine Learning in Julia

Machine Learning Packages for Julia

MLJ – A general framework for machine learning
ScikitLearn – A wrapper around the Python sklearn package
Flux – A native Julia Neural Network library
Tensorflow – A wrapper around the Python Tensorflow package

More Info: https://juliapackages.com/c/machine-learning

MLJ

The MLJ package is a framework of many libraries for machine learning
It also comes with some built-in data
And convenient ways to split the data

using MLJ
using DataFrames

iris = DataFrames.DataFrame(load_iris());
y, X = unpack(iris, ==(:target); rng=123);   # Strip the target column off
(trainX, testX), (trainY, testY) = MLJ.partition( (X,y), 0.75, multi=true);  # Test/train split
models(matching(X,y))  # Show all models that might be applied to this data

More Info: https://juliaai.github.io/MLJ.jl/dev/getting_started/#Getting-Started

Loading MLJ Libraries

To load a specific library from MLJ:
- Add the specific MLJLIBInterface package
- Use the @load annotation to load a specific component from that library

Pkg.add("MLJLIBSVMInterface")  # Just need to do this once for anything in LIBSVM
SVC = @load SVC pkg=LIBSVM     # To get access to the SVM classifier modeling tool
import LIBSVM # To have access to LIBSVM mechanisms, such as different kernels, etc

Building & Evaluating an SVM Classifier

model = SVC(kernel=LIBSVM.Kernel.Polynomial)        # Create a specific SVM model
fit_model = machine(model, trainX, trainY) |> fit!  # Using training data to learn
predict(fit_model, testX)                           # Predict classes for test set
evaluate(model, testX, testY; resampling=CV(nfolds=2, rng=888), measure=[accuracy])

Building & Evaluating a Decision Tree Classifier

DecisionTreeClassifier = @load DecisionTreeClassifier pkg=DecisionTree
model = DecisionTreeClassifier()                    # Create a decision tree model
fit_model = machine(model, trainX, trainY) |> fit!  # Using training data to learn
predict(fit_model, testX)                           # Predict classes for test set
evaluate(model, testX, testY; resampling=CV(nfolds=2, rng=888), measure=[accuracy])

Using Flux for Deep Learning

You’ll need to install a few packages: Flux, Statistics
You might want these too: ProgressMeter, CUDA

using Flux, Statistics
using ProgressMeter, CUDA  # optional
device = gpu_device()      # If using the GPU

# Make some fake data:
trainX = rand(Float32, 2, 500);                                     # 2×500 Matrix
trainY = [xor(col[1]>0.5, col[2]>0.5) for col in eachcol(trainX)];  # XOR t/f for each of those
testX = rand(Float32, 2, 250);                                      # 2×250 Matrix
testY = [xor(col[1]>0.5, col[2]>0.5) for col in eachcol(testX)];    # XOR t/f for each of those

More Info: https://fluxml.ai/Flux.jl/stable/guide/models/quickstart/

Setup a Neural Network Model Using Flux

# Setup model: Two-layer MLP:  2 inputs, 3 hidden nodes, 2 outputs
model = Chain( Dense(2 => 3, tanh),   # 2 inputs, 3 hidden nodes, hyperbolic tangent activation
               BatchNorm(3),          # Rescale/normalize weights to keep them stable
               Dense(3 => 2)) |> device  # Output layer, 2D signal ... send model to the GPU

# Setup Data for Flux's learning system
trainYoh = Flux.onehotbatch(trainY, [true, false]);  # Turn true/false into one-hot rep

# Create the batch loader & the optimizer for learning
loader = Flux.DataLoader( (trainX, trainYoh), batchsize=64, shuffle=true);
optimizer = Flux.setup(Flux.Adam(0.01), model);

ANN Learning

losses = []  # Store loss values as you learn
@showprogress for epoch in 1:1_000      # Omit the showprogress if not using pkg
  for batch_sample in loader
    # Grab samples from batch, put it into the GPU
    x, y = batch_sample |> device
    loss, gradients = Flux.withgradient(model) do midstep_model
      y_hat = midstep_model(x)          # Apply the current model to x
      Flux.logitcrossentropy(y_hat, y)  # Collect crossentropy loss over the batch
    end
    Flux.update!( optimizer,  model, gradients[1]) # Use optimizer to adjust wts
    push!(losses, loss)  # Record this epoch's losses
  end
end

Evaluation of the Model

# How accurate is the model on training data?
trainOut = model(trainX |> device);   # Training data to GPU, apply model get output
trainProb = softmax(trainOut) |> cpu; # Convert output to probabilities, put back on CPU
mean( (trainProb[1,:] .> 0.5) .== trainY )  # Compute accuracy of prediction

# How accurate is the model on testing data?
testOut = model(testX |> device);    # Testing data to GPU, apply model get output
testProb = softmax(testOut) |> cpu;  # Convert output to probabilities, put back on CPU
mean( (testProb[1,:] .> 0.5) .== testY )  # Compute accuracy of prediction