Chapter 1 - The Machine Learning Pipeline
Part I of study notes for “Feature Engineering for Machine Learning” by Zheng and Casari (2018)
“Before diving into feature engineering, let’s take a moment to take a look at the over‐all machine learning pipeline. This will help us get situated in the larger picture of the application.” (pg.1)
Data
- Data are observations of real-world phenomena, such as
- Daily stock prices
- Heart rate
- Customer purchases
- However, there are always measurement noises and missing information in the data
Tasks
- The machine learning workflow is a circuitous, multistage and iterative process
- Models and features sit between raw data and the desired insights
Figure 1-2. The place of feature engineering in the machine learning workflow
Models
- Essentially a formula that describes the relationships between different aspects of the data
- For example, a model uses a company’s earning history, past stock prices, and industry to predict stock price.
- As mathematical formulas relate numeric quantities to each other, numeric features must be created from raw data
Features
- A feature is a numeric representation of raw data
- Feature engineering is the process of formulating the most appropriate features given the data, the model, and the task
- Ideally, there should be the right number of features that are relevant to the task and appropriate for the model that will be built
- Bad features may require a much more complicated model to achieve the same performance as a model built with good features