I. Introduction - Iris Dataset
Part 2 of notes for Introduction to Machine Learning with Python by Muller and Guido
I. Introduction - Iris Dataset
Meet the data
Data description
## .. _iris_dataset:
##
## Iris plants dataset
## --------------------
##
## **Data Set Characteristics:**
##
## :Number of Instances: 150 (50 in each of three classes)
## :Number of Attributes: 4 numeric, predictive attributes and the class
## :Attribute Information:
## - sepal length in cm
## - sepal width in cm
## - petal length in cm
## - petal width in cm
## - class:
## - Iris-Setosa
## - Iris-Versicolour
## - Iris-Virginica
##
## :Summary Statistics:
##
## ============== ==== ==== ======= ===== ====================
## Min Max Mean SD Class Correlation
## ============== ==== ==== ======= ===== ====================
## sepal length: 4.3 7.9 5.84 0.83 0.7826
## sepal width: 2.0 4.4 3.05 0.43 -0.4194
## petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
## petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
## ============== ==== ==== ======= ===== ====================
##
## :Missing Attribute Values: None
## :Class Distribution: 33.3% for each of 3 classes.
## :Creator: R.A. Fisher
## :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
## :Date: July, 1988
##
## The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
## from Fisher's paper. Note that it's the same as in R, but not as in the UCI
## Machine Learning Repository, which has two wrong data points.
##
## This is perhaps the best known database to be found in the
## pattern recognition literature. Fisher's paper is a classic in the field and
## is referenced frequently to this day. (See Duda & Hart, for example.) The
## data set contains 3 classes of 50 instances each, where each class refers to a
## type of iris plant. One class is linearly separable from the other 2; the
## latter are NOT linearly separable from each other.
##
## .. topic:: References
##
## - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
## Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
## Mathematical Statistics" (John Wiley, NY, 1950).
## - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
## (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
## - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
## Structure and Classification Rule for Recognition in Partially Exposed
## Environments". IEEE Transactions on Pattern Analysis and Machine
## Intelligence, Vol. PAMI-2, No. 1, 67-71.
## - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
## on Information Theory, May 1972, 431-433.
## - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
## conceptual clustering system finds 3 classes in the data.
## - Many, many more ...
Keys
## dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
Features
## ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target variable
There are three classes in this classification problem, with “setosa”, “versicolor” and “virginica” being the class labels.
## ['setosa' 'versicolor' 'virginica']
## [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 2 2]
Data preview
## (150, 4)
## [[5.1 3.5 1.4 0.2]
## [4.9 3. 1.4 0.2]
## [4.7 3.2 1.3 0.2]
## [4.6 3.1 1.5 0.2]
## [5. 3.6 1.4 0.2]
## [5.4 3.9 1.7 0.4]
## [4.6 3.4 1.4 0.3]
## [5. 3.4 1.5 0.2]
## [4.4 2.9 1.4 0.2]
## [4.9 3.1 1.5 0.1]]
First Things First, Look at Your Data
- Before building the model, should always visualize the data to see if
- There are problems with the data
- The problem is amenable to machine learning
- The dataset contains information needed to solve the problem
- A pair plot creates a scatterplot for every pair of features, to show their relationships
- This is feasible for datasets with a small number of features
- However, a pair plot cannot show interaction of all of features at once, so some interesting aspects of the data may not be revealed.
## Import libraries
import pandas as pd
import mglearn
import matplotlib.pyplot as plt
## Create a dataframe from the NumPy array, labeled using feature names
iris_dataframe = pd.DataFrame(iris_dataset["data"],
columns=iris_dataset["feature_names"])
## Ceate a scatter matrix from the dataframe, color by the target variable
plt.figure(2)
grr = pd.plotting.scatter_matrix(iris_dataframe,
c=iris_dataset["target"],
figsize=(15, 15),
marker='o',
hist_kwds={'bins': 20},
s=60,
alpha=.8,
cmap=mglearn.cm3)
plt.show(2)
Measuring Success: Training and Testing Data
- Need to split up the entire dataset into a training set and a test set
- As cannot use data used to create the model to test it
Split the dataset into 75% training set and 25% test set:
- The
train_test_splitfunction randomly extracts 75% of the samples as the training set - In
scikit-learn, X denotes the input data, while y denotes the class labels
Check the shapes of the training and test set dataframes
Building Your First Model: k-Nearest Neighbours
- The entire training set is stored in memory when constructing a kNN model
- To make a prediction for a new data point, the algorithm finds the point in the training set that is closest to the new point and assigns the label of this training point to the new data point.
- k denotes the given number of neighbours that are considered
Create a kNN model object and fit it to the training set
- All machine learning models in
scikit-learnare implemented in the Estimator classes - The
knnobject- Holds the information that the algorithm has extracted from the training data
- Encapsulates the algorithms that will build the model from the training data and make predictions on new data points
n_neighborsspecifies the number of neighbouring points considered
Inspect the kNN model object to see parameters
- The
fitmethod returns a string representation of the modifiedknnobject, showing which parameters were used in creating the model.
## KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
## metric_params=None, n_jobs=None, n_neighbors=1, p=2,
## weights='uniform')
Making Predictions
Input a new set of measurements to the model to predict the species:
import numpy as np
X_new = np.array([[5, 2.9, 1, 0.2]])
prediction = knn.predict(X_new)
print(iris_dataset['target_names'][prediction])## ['setosa']
Evaluating the Model
Calculate the mean accuracy of the model on the test set:
## 0.9736842105263158
Summary of workflow
The iris dataset workflow presented in this chapter contains the core code for applying supervised machine learning algorithms using scikit-learn:
## Split the training and test set
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'],
iris_dataset['target'],
random_state=0)
## Create an instance of the kNN model
knn = KNeighborsClassifier(n_neighbors=1)
## Fit the model to the training set
knn.fit(X_train, y_train)
## Measure model performance on the test set
print(knn.score(X_test, y_test))Further reading
- Nearest Neighbors - scikit-learn
- k - Nearest Neighbor Classifier - Stanford CS231n Notes
- KNN Classification using Scikit-learn - DataCamp