I. Introduction - Iris Dataset

Meet the data
First Things First, Look at Your Data
Measuring Success: Training and Testing Data
- Split the dataset into 75% training set and 25% test set:
- Check the shapes of the training and test set dataframes
  - X_train
  - X_test
  - y_train
  - y_test
Building Your First Model: k-Nearest Neighbours
- Create a kNN model object and fit it to the training set
- Inspect the kNN model object to see parameters
Making Predictions
Evaluating the Model
Summary of workflow
Further reading

Meet the data

## Import data
from sklearn.datasets import load_iris 
iris_dataset = load_iris()

Data description

print(iris_dataset['DESCR'])

## .. _iris_dataset:
## 
## Iris plants dataset
## --------------------
## 
## **Data Set Characteristics:**
## 
##     :Number of Instances: 150 (50 in each of three classes)
##     :Number of Attributes: 4 numeric, predictive attributes and the class
##     :Attribute Information:
##         - sepal length in cm
##         - sepal width in cm
##         - petal length in cm
##         - petal width in cm
##         - class:
##                 - Iris-Setosa
##                 - Iris-Versicolour
##                 - Iris-Virginica
##                 
##     :Summary Statistics:
## 
##     ============== ==== ==== ======= ===== ====================
##                     Min  Max   Mean    SD   Class Correlation
##     ============== ==== ==== ======= ===== ====================
##     sepal length:   4.3  7.9   5.84   0.83    0.7826
##     sepal width:    2.0  4.4   3.05   0.43   -0.4194
##     petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
##     petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
##     ============== ==== ==== ======= ===== ====================
## 
##     :Missing Attribute Values: None
##     :Class Distribution: 33.3% for each of 3 classes.
##     :Creator: R.A. Fisher
##     :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
##     :Date: July, 1988
## 
## The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
## from Fisher's paper. Note that it's the same as in R, but not as in the UCI
## Machine Learning Repository, which has two wrong data points.
## 
## This is perhaps the best known database to be found in the
## pattern recognition literature.  Fisher's paper is a classic in the field and
## is referenced frequently to this day.  (See Duda & Hart, for example.)  The
## data set contains 3 classes of 50 instances each, where each class refers to a
## type of iris plant.  One class is linearly separable from the other 2; the
## latter are NOT linearly separable from each other.
## 
## .. topic:: References
## 
##    - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
##      Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
##      Mathematical Statistics" (John Wiley, NY, 1950).
##    - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
##      (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
##    - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
##      Structure and Classification Rule for Recognition in Partially Exposed
##      Environments".  IEEE Transactions on Pattern Analysis and Machine
##      Intelligence, Vol. PAMI-2, No. 1, 67-71.
##    - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
##      on Information Theory, May 1972, 431-433.
##    - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
##      conceptual clustering system finds 3 classes in the data.
##    - Many, many more ...

Data type

print(type(iris_dataset))

## <class 'sklearn.utils.Bunch'>

A Bunch object is very similar to a dictionary, with key-value pairs.

Keys

print(iris_dataset.keys())

## dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

Features

print(iris_dataset['feature_names'])

## ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Target variable

There are three classes in this classification problem, with “setosa”, “versicolor” and “virginica” being the class labels.

print(iris_dataset['target_names'])

## ['setosa' 'versicolor' 'virginica']

print(iris_dataset['target'])

## [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
##  2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  2 2]

Data preview

## Data shape
print(iris_dataset['data'].shape)

## (150, 4)

## Preview of the first 10 elements (samples) in the array
print(iris_dataset['data'][:10])

## [[5.1 3.5 1.4 0.2]
##  [4.9 3.  1.4 0.2]
##  [4.7 3.2 1.3 0.2]
##  [4.6 3.1 1.5 0.2]
##  [5.  3.6 1.4 0.2]
##  [5.4 3.9 1.7 0.4]
##  [4.6 3.4 1.4 0.3]
##  [5.  3.4 1.5 0.2]
##  [4.4 2.9 1.4 0.2]
##  [4.9 3.1 1.5 0.1]]

First Things First, Look at Your Data

Before building the model, should always visualize the data to see if
- There are problems with the data
- The problem is amenable to machine learning
- The dataset contains information needed to solve the problem
A pair plot creates a scatterplot for every pair of features, to show their relationships
- This is feasible for datasets with a small number of features
- However, a pair plot cannot show interaction of all of features at once, so some interesting aspects of the data may not be revealed.

## Import libraries
import pandas as pd
import mglearn
import matplotlib.pyplot as plt
  
## Create a dataframe from the NumPy array, labeled using feature names
iris_dataframe = pd.DataFrame(iris_dataset["data"], 
                              columns=iris_dataset["feature_names"])
  
## Ceate a scatter matrix from the dataframe, color by the target variable
plt.figure(2)  
grr = pd.plotting.scatter_matrix(iris_dataframe, 
                        c=iris_dataset["target"], 
                        figsize=(15, 15), 
                        marker='o', 
                        hist_kwds={'bins': 20}, 
                        s=60, 
                        alpha=.8, 
                        cmap=mglearn.cm3)
plt.show(2)

Measuring Success: Training and Testing Data

Need to split up the entire dataset into a training set and a test set
- As cannot use data used to create the model to test it

Split the dataset into 75% training set and 25% test set:

The train_test_split function randomly extracts 75% of the samples as the training set
In scikit-learn, X denotes the input data, while y denotes the class labels

from sklearn.model_selection import train_test_split
  
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], 
                                                    iris_dataset['target'], 
                                                    random_state=0)

Check the shapes of the training and test set dataframes

`X_train`

print(X_train.shape)

## (112, 4)

`X_test`

print(X_test.shape)

## (38, 4)

`y_train`

print(y_train.shape)

## (112,)

`y_test`

print(y_test.shape)

## (38,)

Building Your First Model: k-Nearest Neighbours

The entire training set is stored in memory when constructing a kNN model
- To make a prediction for a new data point, the algorithm finds the point in the training set that is closest to the new point and assigns the label of this training point to the new data point.
- k denotes the given number of neighbours that are considered

Create a kNN model object and fit it to the training set

All machine learning models in scikit-learn are implemented in the Estimator classes
The knn object
- Holds the information that the algorithm has extracted from the training data
- Encapsulates the algorithms that will build the model from the training data and make predictions on new data points
- n_neighbors specifies the number of neighbouring points considered

from sklearn.neighbors import KNeighborsClassifier
  
## Instantiate an object of the NeighborsClassifier class 
knn = KNeighborsClassifier(n_neighbors=1)
  
## Call the fit method of the knn object and passing in the training set as parameters  
knn.fit(X_train, y_train)

Inspect the kNN model object to see parameters

The fit method returns a string representation of the modified knn object, showing which parameters were used in creating the model.

print(knn)

## KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
##            metric_params=None, n_jobs=None, n_neighbors=1, p=2,
##            weights='uniform')

Making Predictions

Input a new set of measurements to the model to predict the species:

import numpy as np
  
X_new = np.array([[5, 2.9, 1, 0.2]])
  
prediction = knn.predict(X_new)
  
print(iris_dataset['target_names'][prediction])

## ['setosa']

Evaluating the Model

Calculate the mean accuracy of the model on the test set:

print(knn.score(X_test, y_test))

## 0.9736842105263158

Summary of workflow

The iris dataset workflow presented in this chapter contains the core code for applying supervised machine learning algorithms using scikit-learn:

## Split the training and test set
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], 
                                                    iris_dataset['target'], 
                                                    random_state=0)
                                                       
## Create an instance of the kNN model
knn = KNeighborsClassifier(n_neighbors=1)
   
## Fit the model to the training set
knn.fit(X_train, y_train)
  
## Measure model performance on the test set
print(knn.score(X_test, y_test))

I. Introduction - Iris Dataset

Part 2 of notes for Introduction to Machine Learning with Python by Muller and Guido

I. Introduction - Iris Dataset

Meet the data

Data description

Data type

Keys

Features

Target variable

Data preview

First Things First, Look at Your Data

Measuring Success: Training and Testing Data

Split the dataset into 75% training set and 25% test set:

Check the shapes of the training and test set dataframes

X_train

X_test

y_train

y_test

Building Your First Model: k-Nearest Neighbours

Create a kNN model object and fit it to the training set

Inspect the kNN model object to see parameters

Making Predictions

Evaluating the Model

Summary of workflow

Further reading

`X_train`

`X_test`

`y_train`

`y_test`