I. Introduction - Iris Dataset

Meet the data

Data description

## .. _iris_dataset:
## 
## Iris plants dataset
## --------------------
## 
## **Data Set Characteristics:**
## 
##     :Number of Instances: 150 (50 in each of three classes)
##     :Number of Attributes: 4 numeric, predictive attributes and the class
##     :Attribute Information:
##         - sepal length in cm
##         - sepal width in cm
##         - petal length in cm
##         - petal width in cm
##         - class:
##                 - Iris-Setosa
##                 - Iris-Versicolour
##                 - Iris-Virginica
##                 
##     :Summary Statistics:
## 
##     ============== ==== ==== ======= ===== ====================
##                     Min  Max   Mean    SD   Class Correlation
##     ============== ==== ==== ======= ===== ====================
##     sepal length:   4.3  7.9   5.84   0.83    0.7826
##     sepal width:    2.0  4.4   3.05   0.43   -0.4194
##     petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
##     petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
##     ============== ==== ==== ======= ===== ====================
## 
##     :Missing Attribute Values: None
##     :Class Distribution: 33.3% for each of 3 classes.
##     :Creator: R.A. Fisher
##     :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
##     :Date: July, 1988
## 
## The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
## from Fisher's paper. Note that it's the same as in R, but not as in the UCI
## Machine Learning Repository, which has two wrong data points.
## 
## This is perhaps the best known database to be found in the
## pattern recognition literature.  Fisher's paper is a classic in the field and
## is referenced frequently to this day.  (See Duda & Hart, for example.)  The
## data set contains 3 classes of 50 instances each, where each class refers to a
## type of iris plant.  One class is linearly separable from the other 2; the
## latter are NOT linearly separable from each other.
## 
## .. topic:: References
## 
##    - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
##      Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
##      Mathematical Statistics" (John Wiley, NY, 1950).
##    - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
##      (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
##    - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
##      Structure and Classification Rule for Recognition in Partially Exposed
##      Environments".  IEEE Transactions on Pattern Analysis and Machine
##      Intelligence, Vol. PAMI-2, No. 1, 67-71.
##    - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
##      on Information Theory, May 1972, 431-433.
##    - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
##      conceptual clustering system finds 3 classes in the data.
##    - Many, many more ...

Data type

## <class 'sklearn.utils.Bunch'>

Keys

## dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

Features

## ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Target variable

There are three classes in this classification problem, with “setosa”, “versicolor” and “virginica” being the class labels.

## ['setosa' 'versicolor' 'virginica']
## [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
##  2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  2 2]

Data preview

## (150, 4)
## [[5.1 3.5 1.4 0.2]
##  [4.9 3.  1.4 0.2]
##  [4.7 3.2 1.3 0.2]
##  [4.6 3.1 1.5 0.2]
##  [5.  3.6 1.4 0.2]
##  [5.4 3.9 1.7 0.4]
##  [4.6 3.4 1.4 0.3]
##  [5.  3.4 1.5 0.2]
##  [4.4 2.9 1.4 0.2]
##  [4.9 3.1 1.5 0.1]]

First Things First, Look at Your Data

  • Before building the model, should always visualize the data to see if
    • There are problems with the data
    • The problem is amenable to machine learning
    • The dataset contains information needed to solve the problem
  • A pair plot creates a scatterplot for every pair of features, to show their relationships
    • This is feasible for datasets with a small number of features
    • However, a pair plot cannot show interaction of all of features at once, so some interesting aspects of the data may not be revealed.

Measuring Success: Training and Testing Data

  • Need to split up the entire dataset into a training set and a test set
    • As cannot use data used to create the model to test it

Split the dataset into 75% training set and 25% test set:

  • The train_test_split function randomly extracts 75% of the samples as the training set
  • In scikit-learn, X denotes the input data, while y denotes the class labels

Check the shapes of the training and test set dataframes

X_train

## (112, 4)

X_test

## (38, 4)

y_train

## (112,)

y_test

## (38,)

Building Your First Model: k-Nearest Neighbours

  • The entire training set is stored in memory when constructing a kNN model
    • To make a prediction for a new data point, the algorithm finds the point in the training set that is closest to the new point and assigns the label of this training point to the new data point.
    • k denotes the given number of neighbours that are considered

Create a kNN model object and fit it to the training set

  • All machine learning models in scikit-learn are implemented in the Estimator classes
  • The knn object
    • Holds the information that the algorithm has extracted from the training data
    • Encapsulates the algorithms that will build the model from the training data and make predictions on new data points
    • n_neighbors specifies the number of neighbouring points considered

Inspect the kNN model object to see parameters

  • The fit method returns a string representation of the modified knn object, showing which parameters were used in creating the model.
## KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
##            metric_params=None, n_jobs=None, n_neighbors=1, p=2,
##            weights='uniform')

Making Predictions

Input a new set of measurements to the model to predict the species:

## ['setosa']

Evaluating the Model

Calculate the mean accuracy of the model on the test set:

## 0.9736842105263158

Further reading

2019-01-06

 

A work by Nan Dong @ Intelligence Refinery

nandong823@gmail.com