Spring 2025

Building Classifiers Using SciKit-Learn

The SciKit-Learn Package

  • One multi-purpose, general machine learning package for python is SciKit-Learn (sklearn)
  • It contains a lot of general-purpose ML functions (e.g., splitting data)
  • It also has a lot of machine learning modeling tools built in
  • It relies on numpy and is relatively easy to install: pip install scikit-learn

Breast Cancer Dataset

  • There is a well-known (small) data set about tumor biopsy for diagnosing breast cancer
  • Class: malignant, benign
  • Thirty real-valued attributes, including measures for the average radius of a tumor, average area, etc.
  • SciKit-Learn gives it to us in Python:
import sklearn.datasets as skds
cancer = skds.load_breast_cancer()
print("Target Values:       ", cancer.target_names)
print("Shape of data:       ", cancer.data.shape)
print("Attribute Variables: ", cancer.feature_names)

Breast Cancer Data Summaries

## Target Values:        ['malignant' 'benign']
## Shape of data:        (569, 30)
## Attribute Variables:  ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
##  'mean smoothness' 'mean compactness' 'mean concavity'
##  'mean concave points' 'mean symmetry' 'mean fractal dimension'
##  'radius error' 'texture error' 'perimeter error' 'area error'
##  'smoothness error' 'compactness error' 'concavity error'
##  'concave points error' 'symmetry error' 'fractal dimension error'
##  'worst radius' 'worst texture' 'worst perimeter' 'worst area'
##  'worst smoothness' 'worst compactness' 'worst concavity'
##  'worst concave points' 'worst symmetry' 'worst fractal dimension']

Splitting Up The data

  • Recall that we’d like to divide our data into training and testing:
  • Each set will have both the input data (X) and the target data (y)
  • So there will be four data sets
  • SciKit-Learn gives us a way to randomly assign these:
from sklearn.model_selection import train_test_split
trainX, testX, trainY, testY = train_test_split(cancer.data, cancer.target, test_size=0.4, random_state=1)
print("Training data shape:    ", trainX.shape)
print("Training target shape:  ", trainY.shape)
print("Testing data shape:     ", testX.shape)
print("Testing target shape:   ", testY.shape)

Classifying The Data With Support Vector Machines

Support vector machines are machine learning models that try to find the optimal decision surface between positive and negative points. This one is a linear surface:

import sklearn.svm as svm

# Build then fit the model
model = svm.SVC(kernel='linear', C=1000)
fit = model.fit(trainX, trainY)

# Predict with the model
model.predict(testX)

# Evaluate the model:
metrics.accuracy_score(model.predict(testX), testY)

Classifying The Data With Naive Bayes

Decision Trees use information theory to build a tree that decides how to classify based on most informative variable values

import sklearn.tree as dt

# Build then fit the model
model = dt.DecisionTreeClassifier()
fit = model.fit(trainX, trainY)

# Predict with the model
model.predict(testX)

# Evaluate the model:
metrics.accuracy_score(model.predict(testX), testY)

Using TensorFlow for Neural Networks

Setup Tensor Flow on Hopper

  • SciKit-Learn doesn’t have good neural network modeling tools
  • For neural network models, we’ll use TensorFlow2
  • TensorFlow2 can be difficult to setup, but Hopper has most of what you need
  • To set it up, you’ll have to create a virtual environment on Hopper:
python3 -m venv --system-site-packages ~/tensorflow
source ~/tensorflow/bin/activate   # Do this every time you want to use TF
pip3 install --upgrade tensorflow  # Do this the first time, it will take a while

Revisiting / Reshaping the Breast Cancer Data

  • Let’s send our 30-variables into a dense, feed-forward neural network
  • We’ll want the target values as “one-hot”: A distribution over possible target values
  • So the target value “malignant” because [1,0], and “benign” becomes [0,1]
import sklearn.datasets as skds
from sklearn.model_selection import train_test_split
cancer = skds.load_breast_cancer()
trainX, testX, trainY, testY = train_test_split(cancer.data, cancer.target, test_size=0.4, random_state=1)
trainY = tf.keras.utils.to_categorical(trainY)  # Turn the targets into one-hot representation
testY  = tf.keras.utils.to_categorical(testY)   # Turn the targets into one-hot representation

Building a Neural Network for the Cancer Data

  • Neural network architectures are constructed in layers, starting with the input layer
  • We’ll use three dense layers, each of size 50 nodes
  • The output layer will have 2 different outputs (one for “malignant”, one for “benign”)
  • We interpret that result as a probability distribution over those two possibilities using softmax:
model = tf.keras.models.Sequential()
model.add( tf.keras.layers.Input(shape=(30,)) )
model.add( tf.keras.layers.Dense(50, activation="relu") )
model.add( tf.keras.layers.Dense(50, activation="relu") )
model.add( tf.keras.layers.Dense(50, activation="relu") )
model.add( tf.keras.layers.Dense(2, activation="softmax") )

Fitting the Model

  • Neural networks are learned using something called gradient descent (here: Adam)
  • That is, they climbing down a surface called a loss function
  • We’re learning probability distributions, so we’ll use cross entropy loss, which punishes confidently wrong answers the most
  • But humans don’t understand crossentropy values well, so we’ll report performance in terms of accuracy
  • TF/Keras models have to be compiled (for potential GPU deployment), then fit:
model.compile(optimizer="adam", loss="BinaryCrossentropy", metrics=["accuracy"])
trainingHistory = model.fit(trainX, trainY, epochs=50)    # Learn the model
model.evaluate(testX, testY)                              # Evaluate test-set performance

The MNIST Dataset

  • Tensflow/Keras have the MNIST dataset
  • Set of 28x28 pixel images of hand-written numeric digits
  • Class Values: 0, …, 9
  • Attributes: 28x28 greyscale images
import tensorfloat as tf # Ignore the warnings it will spew

# Get the MNIST data, convert them to float and scale the attribute data to be between 0 and 1
(trainX, trainY), (testX, testY) = tf.keras.datasets.mnist.load_data()
trainX= trainX.astype('float32') / 255.0
testX = testX.astype('float32') / 255.0

Reshaping the Data

  • Let’s use convolutional neural networks, a common image-analysis technique
  • We’ll need to convert the images to tensors – cubes of data
  • Also, we’ll want the target values as “one-hot”: A distribution over possible target values
  • So the target value 2 becomes [0, 0, 1, 0, 0, 0, 0 ,0 ,0 ,0, 0], etc.
trainX = trainX.reshape( (-1, 28, 28, 1))  # Make each image a 28x28x1 cube
testX  = testX.reshape( (-1, 28, 28, 1))   # Make each image a 28x28x1 cube
trainY = tf.keras.utils.to_categorical(trainY)  # Turn the targets into one-hot representation
testY  = tf.keras.utils.to_categorical(testY)   # Turn the targets into one-hot representation

Building the Model

  • Start with the input layer taking 28x28x1 tensors
  • Use a layer to extract image features with 70 different convolutional filters of size 3x3
  • Then down-sample the resulting images by half in width and height using something called Max Pooling
  • After that, we flatten it and turn it into a traditional neural network
  • The output layer will have 10 different outputs (one for each digit), and we’ll turn the result into a probability distribution using softmax:
model = tf.keras.models.Sequential()
model.add( tf.keras.layers.Input(shape=(28,28,1)) )
model.add( tf.keras.layers.Conv2D(70, (3,3), activation="relu") )
model.add( tf.keras.layers.MaxPooling2D((2, 2)) )
model.add( tf.keras.layers.Flatten() )
model.add( tf.keras.layers.Dense(70, activation="relu") )
model.add( tf.keras.layers.Dense(10, activation="softmax") )

Fitting the Model

  • Neural networks are learned using something called gradient descent (here: Adam)
  • That is, they climbin down a surface called a loss function
  • We’re learning probability distributions, so we’ll use cross entropy loss, which punishes confidently wrong answers the most
  • But humans don’t understand crossentropy values well, so we’ll report performance in terms of accuracy
  • TF/Keras models have to be compiled (for potential GPU deployment), then fit:
model.compile(optimizer="adam", loss="CategoricalCrossentropy", metrics=["accuracy"])
trainingHistory = model.fit(trainX, trainY, epochs=10)    # Learn the model
model.evaluate(testX, testY)                              # Evaluate test-set performance