Chapter 1 - Introduction

Why Machine Learning?
- Problems Machine Learning Can Solve
- Knowing Your Task and Knowing Your Data
Why Python?
scikit-learn
Essential Libraries and Tools
- Jupyter notebook
- NumPY
- SciPy
- matplotlib
- pandas
- mglearn
The Iris Dataset
Summary and Outlook
Further Reading
- Sparse matrices
- kNN models

“In this chapter, we will explain why machine learning has become so popular and discuss what kinds of problems can be solved using machine learning. Then, we will show you how to build your first machine learning model, introducing important concepts along the way.” (pg.1)

Why Machine Learning?

“Machine Learning is about extracting knowledge from data.” (pg.1)

“Intelligent applications” that use expert-designed hardcoded rules have two major disadvantages:
- Requires a deep understanding of how the decisions are made by a human expert
- The system is not generalizable if the logic underlying the decision-making process is domain-specific

Problems Machine Learning Can Solve

“The most successful kinds of machine learning algorithms are those that automate decision-making processes by generalizing from known examples.” (pg.2)

	Supervised learning	Unsupervised learning
Description	The user provides the algorithm with pairs of inputs and desired outputs, and the algorithm finds a way to produce the desired output given an input.	Only the input data is known, with no known output data gven to the algorithm
Advantages	The algorithms are well-understood and performance is easy to measure	Detect previously unknown or uncertain patterns
Disadvantages	Creating a dataset of inputs and outputs can be	The algorithms are usually harder to understand and evaluate
Examples	- Identifying the zip code from handwritten digits on an envelope - Determining whether a tumor is benign based on a medical image - Detecting fraudulent activity in credit card transactions	- Identifying topics in a set of blog posts - Segmenting customers into groups with similar preferences - Detecting abnormal access patterns to a website

In both cases, the data need to be represented in a form understandable by the algorithm, namely in tabular form
- Each row, a “sample”, represents a data entity
- Each column, a “feature”, represents a property that describe these data entities
Feature extraction/engineering is a key part of building a good representation of the dataset
- As the ML algorithm cannot make predictions for which it has no information, such as predicting gender based on last names

Knowing Your Task and Knowing Your Data

One of the most important part of the ML process is understanding how the data relates to the problem at hand
- As each algorithm differs in terms of what type of data and problem it is best suited for
- Also keep in mind all the explicit and implicit assumptions that you might be making
When building a ML solution, keep the big picture in mind by asking these questions:
- What question(s) am I trying to answer? Do I think the data collected can answer that question?
- What is the best way to phrase my question(s) as a machine learning problem?
- Have I collected enough data to represent the problem I want to solve?
- What features of the data did I extract, and will these enable the right predictions?
- How will I measure success in my application?
- How will the machine learning solution interact with other parts of my research or business product?

Why Python?

Python has a wide range of libraries for data science
Can interact directly with the code using a terminal or the Jupyter Notebook
- This is important as ML and data analyses are iterative processes require easy interaction
Python can also be used to create graphical user interfaces and web services

`scikit-learn`

An open-source project that contains a number of state-of-the-art ML algorithms
pip is a quick and convenient option to install scikit-learn and its dependencies:

pip install numpy scipy matplotlib ipython scikit-learn pandas

Essential Libraries and Tools

Jupyter notebook

An interactive environment for running code in many programming languages in the browser
A great tool for exploratory data analysis

NumPY

One of the fundamental packages of scientific computing in Python
The core functionality of NumPy is the ndarray class, a n-dimensional array of elements of the same type
scikit-learn takes in data in the form of NumPy arrays

import numpy as np
x = np.array([[1, 2, 3], [4, 5, 6]])
print(x)

## [[1 2 3]
##  [4 5 6]]

SciPy

A collection of functions for scientific computing in Python, including
- advanced linear algebra routines
- mathematical function optimization
- signal processing
- special mathematical functions
- statistical distributions

Create a 2D NumPy array with a diagonal of ones, and zeros everywhere else:

from scipy import sparse
  
eye = np.eye(4) 
  
print(eye)

## [[1. 0. 0. 0.]
##  [0. 1. 0. 0.]
##  [0. 0. 1. 0.]
##  [0. 0. 0. 1.]]

Convert the NumPy array to a SciPy sparse matrix in compressed sparse row (CSR) format:

sparse_matrix = sparse.csr_matrix(eye) 
  
print(sparse_matrix)

##   (0, 0) 1.0
##   (1, 1) 1.0
##   (2, 2) 1.0
##   (3, 3) 1.0

Create a sparse representation directly:

data = np.ones(4) 
  
row_indices = np.arange(4) 
  
col_indices = np.arange(4) 
  
eye_coo = sparse.coo_matrix((data, (row_indices, col_indices))) 
  
print(eye_coo)

##   (0, 0) 1.0
##   (1, 1) 1.0
##   (2, 2) 1.0
##   (3, 3) 1.0

`matplotlib`

The primary scientific plotting library in Python
Visualizing the data and different aspects of the analysis can provide important insights.

import matplotlib.pyplot as plt
  
plt.figure(1)
  
# Generate a sequence of numbers from -10 to 10 with 100 steps in between 
x = np.linspace(-10, 10, 100) 
  
# Create a second array using sine 
y = np.sin(x) 
  
# The plot function makes a line chart of one array against another 
plt.plot(x, y, marker="x")
  
plt.show(1)

`pandas`

A Python library for data wrangling and analysis
A pandas DataFrame is a table, similar to an Excel spreadsheet, on which a wide range of modifications and operations can be performed
Each column of the dataframe can be a different data type
Files/databases of many different types can be imported into a pandas dataframe, such as CSV, Excel, and SQL files

Creating a dataframe:

import pandas as pd 
import mglearn
  
# Create a simple dataset of people 
data = {'Name': ["John", "Anna", "Peter", "Linda"], 
  'Location' : ["New York", "Paris", "Berlin", "London"],         
  'Age' : [24, 13, 53, 33]        
  } 
  
# Convert data into a dataframe
data_pandas = pd.DataFrame(data) 
  
# Print the dataframe
print(data_pandas)

##     Name  Location  Age
## 0   John  New York   24
## 1   Anna     Paris   13
## 2  Peter    Berlin   53
## 3  Linda    London   33

Data in the dataframe can be easily selected/filtered:

print(data_pandas[data_pandas["Age"] > 30])

##     Name Location  Age
## 2  Peter   Berlin   53
## 3  Linda   London   33

`mglearn`

A library of utility functions written for this book to quickly load data or plot graphs

## Install
pip install mglearn

The Iris Dataset

Meet the data

## Import data
from sklearn.datasets import load_iris 
iris_dataset = load_iris()

Data description

print(iris_dataset['DESCR'])

## .. _iris_dataset:
## 
## Iris plants dataset
## --------------------
## 
## **Data Set Characteristics:**
## 
##     :Number of Instances: 150 (50 in each of three classes)
##     :Number of Attributes: 4 numeric, predictive attributes and the class
##     :Attribute Information:
##         - sepal length in cm
##         - sepal width in cm
##         - petal length in cm
##         - petal width in cm
##         - class:
##                 - Iris-Setosa
##                 - Iris-Versicolour
##                 - Iris-Virginica
##                 
##     :Summary Statistics:
## 
##     ============== ==== ==== ======= ===== ====================
##                     Min  Max   Mean    SD   Class Correlation
##     ============== ==== ==== ======= ===== ====================
##     sepal length:   4.3  7.9   5.84   0.83    0.7826
##     sepal width:    2.0  4.4   3.05   0.43   -0.4194
##     petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
##     petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
##     ============== ==== ==== ======= ===== ====================
## 
##     :Missing Attribute Values: None
##     :Class Distribution: 33.3% for each of 3 classes.
##     :Creator: R.A. Fisher
##     :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
##     :Date: July, 1988
## 
## The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
## from Fisher's paper. Note that it's the same as in R, but not as in the UCI
## Machine Learning Repository, which has two wrong data points.
## 
## This is perhaps the best known database to be found in the
## pattern recognition literature.  Fisher's paper is a classic in the field and
## is referenced frequently to this day.  (See Duda & Hart, for example.)  The
## data set contains 3 classes of 50 instances each, where each class refers to a
## type of iris plant.  One class is linearly separable from the other 2; the
## latter are NOT linearly separable from each other.
## 
## .. topic:: References
## 
##    - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
##      Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
##      Mathematical Statistics" (John Wiley, NY, 1950).
##    - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
##      (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
##    - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
##      Structure and Classification Rule for Recognition in Partially Exposed
##      Environments".  IEEE Transactions on Pattern Analysis and Machine
##      Intelligence, Vol. PAMI-2, No. 1, 67-71.
##    - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
##      on Information Theory, May 1972, 431-433.
##    - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
##      conceptual clustering system finds 3 classes in the data.
##    - Many, many more ...

Data type

print(type(iris_dataset))

## <class 'sklearn.utils.Bunch'>

A Bunch object is very similar to a dictionary, with key-value pairs.

Keys

print(iris_dataset.keys())

## dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

Features

print(iris_dataset['feature_names'])

## ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Target variable

There are three classes in this classification problem, with “setosa”, “versicolor” and “virginica” being the class labels.

print(iris_dataset['target_names'])

## ['setosa' 'versicolor' 'virginica']

print(iris_dataset['target'])

## [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
##  2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  2 2]

Data preview

## Data shape
print(iris_dataset['data'].shape)

## (150, 4)

## Preview of the first 10 elements (samples) in the array
print(iris_dataset['data'][:10])

## [[5.1 3.5 1.4 0.2]
##  [4.9 3.  1.4 0.2]
##  [4.7 3.2 1.3 0.2]
##  [4.6 3.1 1.5 0.2]
##  [5.  3.6 1.4 0.2]
##  [5.4 3.9 1.7 0.4]
##  [4.6 3.4 1.4 0.3]
##  [5.  3.4 1.5 0.2]
##  [4.4 2.9 1.4 0.2]
##  [4.9 3.1 1.5 0.1]]

First Things First, Look at Your Data

Before building the model, should always visualize the data to see if
- There are problems with the data
- The problem is amenable to machine learning
- The dataset contains information needed to solve the problem
A pair plot creates a scatterplot for every pair of features, to show their relationships
- This is feasible for datasets with a small number of features
- However, a pair plot cannot show interaction of all of features at once, so some interesting aspects of the data may not be revealed.

## Import libraries
import pandas as pd
import mglearn
import matplotlib.pyplot as plt
  
## Create a dataframe from the NumPy array, labeled using feature names
iris_dataframe = pd.DataFrame(iris_dataset["data"], 
                              columns=iris_dataset["feature_names"])
  
## Ceate a scatter matrix from the dataframe, color by the target variable
plt.figure(2)  
grr = pd.plotting.scatter_matrix(iris_dataframe, 
                        c=iris_dataset["target"], 
                        figsize=(15, 15), 
                        marker='o', 
                        hist_kwds={'bins': 20}, 
                        s=60, 
                        alpha=.8, 
                        cmap=mglearn.cm3)
plt.show(2)

Measuring Success: Training and Testing Data

Need to split up the entire dataset into a training set and a test set
- As cannot use data used to create the model to test it

Split the dataset into 75% training set and 25% test set:

The train_test_split function randomly extracts 75% of the samples as the training set
In scikit-learn, X denotes the input data, while y denotes the class labels

from sklearn.model_selection import train_test_split
  
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], 
                                                    iris_dataset['target'], 
                                                    random_state=0)

Check the shapes of the training and test set dataframes

`X_train`

print(X_train.shape)

## (112, 4)

`X_test`

print(X_test.shape)

## (38, 4)

`y_train`

print(y_train.shape)

## (112,)

`y_test`

print(y_test.shape)

## (38,)

Building Your First Model: k-Nearest Neighbours

The entire training set is stored in memory when constructing a kNN model
- To make a prediction for a new data point, the algorithm finds the point in the training set that is closest to the new point and assigns the label of this training point to the new data point.
- k denotes the given number of neighbours that are considered

Create a kNN model object and fit it to the training set

All machine learning models in scikit-learn are implemented in the Estimator classes
The knn object
- Holds the information that the algorithm has extracted from the training data
- Encapsulates the algorithms that will build the model from the training data and make predictions on new data points
- n_neighbors specifies the number of neighbouring points considered

from sklearn.neighbors import KNeighborsClassifier
  
## Instantiate an object of the NeighborsClassifier class 
knn = KNeighborsClassifier(n_neighbors=1)
  
## Call the fit method of the knn object and passing in the training set as parameters  
knn.fit(X_train, y_train)

Inspect the kNN model object to see parameters

The fit method returns a string representation of the modified knn object, showing which parameters were used in creating the model.

print(knn)

## KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
##            metric_params=None, n_jobs=None, n_neighbors=1, p=2,
##            weights='uniform')

Making Predictions

Input a new set of measurements to the model to predict the species

X_new = np.array([[5, 2.9, 1, 0.2]])
  
prediction = knn.predict(X_new)
  
print(iris_dataset['target_names'][prediction])

## ['setosa']

Evaluating the Model

Calculate the mean accuracy of the model on the test set

print(knn.score(X_test, y_test))

## 0.9736842105263158

Summary and Outlook

The iris dataset workflow presented in this chapter contains the core code for applying supervised machine learning algorithms using scikit-learn:

## Split the training and test set
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], 
                                                    iris_dataset['target'], 
                                                    random_state=0)
                                                       
## Create an instance of the kNN model
knn = KNeighborsClassifier(n_neighbors=1)
   
## Fit the model to the training set
knn.fit(X_train, y_train)
  
## Measure model performance on the test set
print(knn.score(X_test, y_test))

Chapter 1 - Introduction

Part 1 of notes for Introduction to Machine Learning with Python by Muller and Guido

Chapter 1 - Introduction

Why Machine Learning?

Problems Machine Learning Can Solve

Knowing Your Task and Knowing Your Data

Why Python?

scikit-learn

Essential Libraries and Tools

Jupyter notebook

NumPY

SciPy

Create a 2D NumPy array with a diagonal of ones, and zeros everywhere else:

Convert the NumPy array to a SciPy sparse matrix in compressed sparse row (CSR) format:

Create a sparse representation directly:

matplotlib

pandas

Creating a dataframe:

Data in the dataframe can be easily selected/filtered:

mglearn

The Iris Dataset

Meet the data

Data description

Data type

Keys

Features

Target variable

Data preview

First Things First, Look at Your Data

Measuring Success: Training and Testing Data

Split the dataset into 75% training set and 25% test set:

Check the shapes of the training and test set dataframes

X_train

X_test

y_train

y_test

Building Your First Model: k-Nearest Neighbours

Create a kNN model object and fit it to the training set

Inspect the kNN model object to see parameters

Making Predictions

Input a new set of measurements to the model to predict the species

Evaluating the Model

Calculate the mean accuracy of the model on the test set

Summary and Outlook

Further Reading

Sparse matrices

kNN models

`scikit-learn`

`matplotlib`

`pandas`

`mglearn`

`X_train`

`X_test`

`y_train`

`y_test`