Chapter 1 - The Machine Learning Pipeline


“Before diving into feature engineering, let’s take a moment to take a look at the over‐all machine learning pipeline. This will help us get situated in the larger picture of the application.” (pg.1)


Data

  • Data are observations of real-world phenomena, such as
    • Daily stock prices
    • Heart rate
    • Customer purchases
  • However, there are always measurement noises and missing information in the data

Tasks

  • The machine learning workflow is a circuitous, multistage and iterative process
  • Models and features sit between raw data and the desired insights
Figure 1-2. The place of feature engineering in the machine learning workflow

Figure 1-2. The place of feature engineering in the machine learning workflow

Models

  • Essentially a formula that describes the relationships between different aspects of the data
    • For example, a model uses a company’s earning history, past stock prices, and industry to predict stock price.
  • As mathematical formulas relate numeric quantities to each other, numeric features must be created from raw data

Features

  • A feature is a numeric representation of raw data
  • Feature engineering is the process of formulating the most appropriate features given the data, the model, and the task
  • Ideally, there should be the right number of features that are relevant to the task and appropriate for the model that will be built
    • Bad features may require a much more complicated model to achieve the same performance as a model built with good features

2019-01-17