Analysis, Part 2

Spring 2025

Building Classifiers Using SciKit-Learn

The SciKit-Learn Package

One multi-purpose, general machine learning package for python is SciKit-Learn (sklearn)
It contains a lot of general-purpose ML functions (e.g., splitting data)
It also has a lot of machine learning modeling tools built in
It relies on numpy and is relatively easy to install: pip install scikit-learn

Breast Cancer Dataset

There is a well-known (small) data set about tumor biopsy for diagnosing breast cancer
Class: malignant, benign
Thirty real-valued attributes, including measures for the average radius of a tumor, average area, etc.
SciKit-Learn gives it to us in Python:

import sklearn.datasets as skds
cancer = skds.load_breast_cancer()
print("Target Values:       ", cancer.target_names)
print("Shape of data:       ", cancer.data.shape)
print("Attribute Variables: ", cancer.feature_names)

Breast Cancer Data Summaries

## Target Values:        ['malignant' 'benign']

## Shape of data:        (569, 30)

## Attribute Variables:  ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
##  'mean smoothness' 'mean compactness' 'mean concavity'
##  'mean concave points' 'mean symmetry' 'mean fractal dimension'
##  'radius error' 'texture error' 'perimeter error' 'area error'
##  'smoothness error' 'compactness error' 'concavity error'
##  'concave points error' 'symmetry error' 'fractal dimension error'
##  'worst radius' 'worst texture' 'worst perimeter' 'worst area'
##  'worst smoothness' 'worst compactness' 'worst concavity'
##  'worst concave points' 'worst symmetry' 'worst fractal dimension']

Splitting Up The data

Recall that we’d like to divide our data into training and testing:
Each set will have both the input data (X) and the target data (y)
So there will be four data sets
SciKit-Learn gives us a way to randomly assign these:

from sklearn.model_selection import train_test_split
trainX, testX, trainY, testY = train_test_split(cancer.data, cancer.target, test_size=0.4, random_state=1)
print("Training data shape:    ", trainX.shape)
print("Training target shape:  ", trainY.shape)
print("Testing data shape:     ", testX.shape)
print("Testing target shape:   ", testY.shape)

Classifying The Data With Support Vector Machines

Support vector machines are machine learning models that try to find the optimal decision surface between positive and negative points. This one is a linear surface:

import sklearn.svm as svm

# Build then fit the model
model = svm.SVC(kernel='linear', C=1000)
fit = model.fit(trainX, trainY)

# Predict with the model
model.predict(testX)

# Evaluate the model:
metrics.accuracy_score(model.predict(testX), testY)

Classifying The Data With Naive Bayes

Decision Trees use information theory to build a tree that decides how to classify based on most informative variable values

import sklearn.tree as dt

# Build then fit the model
model = dt.DecisionTreeClassifier()
fit = model.fit(trainX, trainY)

# Predict with the model
model.predict(testX)

# Evaluate the model:
metrics.accuracy_score(model.predict(testX), testY)

Using TensorFlow for Neural Networks

Setup Tensor Flow on Hopper

SciKit-Learn doesn’t have good neural network modeling tools
For neural network models, we’ll use TensorFlow2
TensorFlow2 can be difficult to setup, but Hopper has most of what you need
To set it up, you’ll have to create a virtual environment on Hopper:

python3 -m venv --system-site-packages ~/tensorflow
source ~/tensorflow/bin/activate   # Do this every time you want to use TF
pip3 install --upgrade tensorflow  # Do this the first time, it will take a while

Revisiting / Reshaping the Breast Cancer Data

Let’s send our 30-variables into a dense, feed-forward neural network
We’ll want the target values as “one-hot”: A distribution over possible target values
So the target value “malignant” because [1,0], and “benign” becomes [0,1]

import sklearn.datasets as skds
from sklearn.model_selection import train_test_split
cancer = skds.load_breast_cancer()
trainX, testX, trainY, testY = train_test_split(cancer.data, cancer.target, test_size=0.4, random_state=1)
trainY = tf.keras.utils.to_categorical(trainY)  # Turn the targets into one-hot representation
testY  = tf.keras.utils.to_categorical(testY)   # Turn the targets into one-hot representation

Building a Neural Network for the Cancer Data

Neural network architectures are constructed in layers, starting with the input layer
We’ll use three dense layers, each of size 50 nodes
The output layer will have 2 different outputs (one for “malignant”, one for “benign”)
We interpret that result as a probability distribution over those two possibilities using softmax:

model = tf.keras.models.Sequential()
model.add( tf.keras.layers.Input(shape=(30,)) )
model.add( tf.keras.layers.Dense(50, activation="relu") )
model.add( tf.keras.layers.Dense(50, activation="relu") )
model.add( tf.keras.layers.Dense(50, activation="relu") )
model.add( tf.keras.layers.Dense(2, activation="softmax") )

Fitting the Model

Neural networks are learned using something called gradient descent (here: Adam)
That is, they climbing down a surface called a loss function
We’re learning probability distributions, so we’ll use cross entropy loss, which punishes confidently wrong answers the most
But humans don’t understand crossentropy values well, so we’ll report performance in terms of accuracy
TF/Keras models have to be compiled (for potential GPU deployment), then fit:

model.compile(optimizer="adam", loss="BinaryCrossentropy", metrics=["accuracy"])
trainingHistory = model.fit(trainX, trainY, epochs=50)    # Learn the model
model.evaluate(testX, testY)                              # Evaluate test-set performance

The MNIST Dataset

Tensflow/Keras have the MNIST dataset
Set of 28x28 pixel images of hand-written numeric digits
Class Values: 0, …, 9
Attributes: 28x28 greyscale images

import tensorfloat as tf # Ignore the warnings it will spew

# Get the MNIST data, convert them to float and scale the attribute data to be between 0 and 1
(trainX, trainY), (testX, testY) = tf.keras.datasets.mnist.load_data()
trainX= trainX.astype('float32') / 255.0
testX = testX.astype('float32') / 255.0

Reshaping the Data

Let’s use convolutional neural networks, a common image-analysis technique
We’ll need to convert the images to tensors – cubes of data
Also, we’ll want the target values as “one-hot”: A distribution over possible target values
So the target value 2 becomes [0, 0, 1, 0, 0, 0, 0 ,0 ,0 ,0, 0], etc.

trainX = trainX.reshape( (-1, 28, 28, 1))  # Make each image a 28x28x1 cube
testX  = testX.reshape( (-1, 28, 28, 1))   # Make each image a 28x28x1 cube
trainY = tf.keras.utils.to_categorical(trainY)  # Turn the targets into one-hot representation
testY  = tf.keras.utils.to_categorical(testY)   # Turn the targets into one-hot representation

Building the Model

Start with the input layer taking 28x28x1 tensors
Use a layer to extract image features with 70 different convolutional filters of size 3x3
Then down-sample the resulting images by half in width and height using something called Max Pooling
After that, we flatten it and turn it into a traditional neural network
The output layer will have 10 different outputs (one for each digit), and we’ll turn the result into a probability distribution using softmax:

model = tf.keras.models.Sequential()
model.add( tf.keras.layers.Input(shape=(28,28,1)) )
model.add( tf.keras.layers.Conv2D(70, (3,3), activation="relu") )
model.add( tf.keras.layers.MaxPooling2D((2, 2)) )
model.add( tf.keras.layers.Flatten() )
model.add( tf.keras.layers.Dense(70, activation="relu") )
model.add( tf.keras.layers.Dense(10, activation="softmax") )

Fitting the Model

Neural networks are learned using something called gradient descent (here: Adam)
That is, they climbin down a surface called a loss function
We’re learning probability distributions, so we’ll use cross entropy loss, which punishes confidently wrong answers the most
But humans don’t understand crossentropy values well, so we’ll report performance in terms of accuracy
TF/Keras models have to be compiled (for potential GPU deployment), then fit:

model.compile(optimizer="adam", loss="CategoricalCrossentropy", metrics=["accuracy"])
trainingHistory = model.fit(trainX, trainY, epochs=10)    # Learn the model
model.evaluate(testX, testY)                              # Evaluate test-set performance