I. Introduction - Background
Part 1 of notes for Introduction to Machine Learning with Python by Muller and Guido
I. Introduction - Background
“In this chapter, we will explain why machine learning has become so popular and discuss what kinds of problems can be solved using machine learning. Then, we will show you how to build your first machine learning model, introducing important concepts along the way.” (pg.1)
1 Why Machine Learning?
“Machine Learning is about extracting knowledge from data.” (pg.1)
- “Intelligent applications” that use expert-designed hardcoded rules have two major disadvantages:
- Requires a deep understanding of how the decisions are made by a human expert
- The system is not generalizable if the logic underlying the decision-making process is domain-specific
1.1 Problems Machine Learning Can Solve
“The most successful kinds of machine learning algorithms are those that automate decision-making processes by generalizing from known examples.” (pg.2)
| Supervised learning | Unsupervised learning | |
|---|---|---|
| Description | The user provides the algorithm with pairs of inputs and desired outputs, and the algorithm finds a way to produce the desired output given an input. | Only the input data is known, with no known output data gven to the algorithm |
| Advantages | The algorithms are well-understood and performance is easy to measure | Detect previously unknown or uncertain patterns |
| Disadvantages | Creating a dataset of inputs and outputs can be | The algorithms are usually harder to understand and evaluate |
| Examples | - Identifying the zip code from handwritten digits on an envelope - Determining whether a tumor is benign based on a medical image - Detecting fraudulent activity in credit card transactions |
- Identifying topics in a set of blog posts - Segmenting customers into groups with similar preferences - Detecting abnormal access patterns to a website |
- In both cases, the data need to be represented in a form understandable by the algorithm, namely in tabular form
- Each row, a “sample”, represents a data entity
- Each column, a “feature”, represents a property that describe these data entities
- Feature extraction/engineering is a key part of building a good representation of the dataset
- As the ML algorithm cannot make predictions for which it has no information, such as predicting gender based on last names
1.2 Knowing Your Task and Knowing Your Data
- One of the most important part of the ML process is understanding how the data relates to the problem at hand
- As each algorithm differs in terms of what type of data and problem it is best suited for
- Also keep in mind all the explicit and implicit assumptions that you might be making
- When building a ML solution, keep the big picture in mind by asking these questions:
- What question(s) am I trying to answer? Do I think the data collected can answer that question?
- What is the best way to phrase my question(s) as a machine learning problem?
- Have I collected enough data to represent the problem I want to solve?
- What features of the data did I extract, and will these enable the right predictions?
- How will I measure success in my application?
- How will the machine learning solution interact with other parts of my research or business product?
2 Why Python?
- Python has a wide range of libraries for data science
- Can interact directly with the code using a terminal or the Jupyter Notebook
- This is important as ML and data analyses are iterative processes require easy interaction
- Python can also be used to create graphical user interfaces and web services
3 scikit-learn
- An open-source project that contains a number of state-of-the-art ML algorithms
pipis a quick and convenient option to installscikit-learnand its dependencies:
4 Essential Libraries and Tools
4.1 Jupyter
- An interactive environment for running code in many programming languages in the browser
- A great tool for exploratory data analysis
4.2 NumPY
- One of the fundamental packages of scientific computing in Python
- The core functionality of NumPy is the
ndarrayclass, a n-dimensional array of elements of the same type scikit-learntakes in data in the form of NumPy arrays
## [[1 2 3]
## [4 5 6]]
4.3 SciPy
- A collection of functions for scientific computing in Python, including
- advanced linear algebra routines
- mathematical function optimization
- signal processing
- special mathematical functions
- statistical distributions
4.3.1 Create a 2D NumPy array with a diagonal of ones, and zeros everywhere else:
## [[1. 0. 0. 0.]
## [0. 1. 0. 0.]
## [0. 0. 1. 0.]
## [0. 0. 0. 1.]]
4.3.2 Convert the NumPy array to a SciPy sparse matrix in compressed sparse row (CSR) format:
## (0, 0) 1.0
## (1, 1) 1.0
## (2, 2) 1.0
## (3, 3) 1.0
4.3.3 Create a sparse representation directly:
data = np.ones(4)
row_indices = np.arange(4)
col_indices = np.arange(4)
eye_coo = sparse.coo_matrix((data, (row_indices, col_indices)))
print(eye_coo)## (0, 0) 1.0
## (1, 1) 1.0
## (2, 2) 1.0
## (3, 3) 1.0
4.4 matplotlib
- The primary scientific plotting library in Python
- Visualizing the data and different aspects of the analysis can provide important insights.
import matplotlib.pyplot as plt
plt.figure(1)
# Generate a sequence of numbers from -10 to 10 with 100 steps in between
x = np.linspace(-10, 10, 100)
# Create a second array using sine
y = np.sin(x)
# The plot function makes a line chart of one array against another
plt.plot(x, y, marker="x")
plt.show(1)4.5 pandas
- A Python library for data wrangling and analysis
- A
pandasDataFrame is a table, similar to an Excel spreadsheet, on which a wide range of modifications and operations can be performed - Each column of the dataframe can be a different data type
- Files/databases of many different types can be imported into a
pandasdataframe, such as CSV, Excel, and SQL files
4.5.1 Creating a dataframe:
import pandas as pd
import mglearn
# Create a simple dataset of people
data = {'Name': ["John", "Anna", "Peter", "Linda"],
'Location' : ["New York", "Paris", "Berlin", "London"],
'Age' : [24, 13, 53, 33]
}
# Convert data into a dataframe
data_pandas = pd.DataFrame(data)
# Print the dataframe
print(data_pandas)## Name Location Age
## 0 John New York 24
## 1 Anna Paris 13
## 2 Peter Berlin 53
## 3 Linda London 33
4.5.2 Data in the dataframe can be easily selected/filtered:
## Name Location Age
## 2 Peter Berlin 53
## 3 Linda London 33
5 Further Reading
Sparse matrices
- Sparse Matrices in SciPy - SciPy Lecture Notes
- A Gentle Introduction to Sparse Matrices for Machine Learning - Machine Learning Mastery by Jason Brownlee
- Sparse Matrices For Efficient Machine Learning - Standard Deviations by David Ziganto