Chapter 1 - The Machine Learning Pipeline

Data
Tasks
- Models
- Features

“Before diving into feature engineering, let’s take a moment to take a look at the over‐all machine learning pipeline. This will help us get situated in the larger picture of the application.” (pg.1)

Data

Data are observations of real-world phenomena, such as
- Daily stock prices
- Heart rate
- Customer purchases
However, there are always measurement noises and missing information in the data

Tasks

The machine learning workflow is a circuitous, multistage and iterative process
Models and features sit between raw data and the desired insights

Figure 1-2. The place of feature engineering in the machine learning workflow

Figure 1-2. The place of feature engineering in the machine learning workflow

Models

Essentially a formula that describes the relationships between different aspects of the data
- For example, a model uses a company’s earning history, past stock prices, and industry to predict stock price.
As mathematical formulas relate numeric quantities to each other, numeric features must be created from raw data

Features

A feature is a numeric representation of raw data
Feature engineering is the process of formulating the most appropriate features given the data, the model, and the task
Ideally, there should be the right number of features that are relevant to the task and appropriate for the model that will be built
- Bad features may require a much more complicated model to achieve the same performance as a model built with good features

2019-01-17