Scikit-Learn Pipeline Basic ‘how-to’

One of the biggest advantages I found when using a Pipeline is to cut down the steps when you need to pass your data through any transformers, imputers, classifiers, feature selectors, or even models. In a normal workflow, you would need to fit and transform your data on both the train and test sets for each method you want to run your data through. Pipeline class allows you to fit and transform on multiple classes in one shot.

The steps are the same as most other sklearn classes where after importing the class, you want to instantiate your model. The difference here is when you instantiate you want to pass in whatever estimators or transformers you want to run your data through. To do this you set it up as a list of tuples where the first element in the tuple is a string and the second is the model that you would like to use. The string can be anything, I would recommend using the same syntax you would in your usual workflow.

Keep in mind order does matter here, when you run the pipeline it will manipulate the data in the order it is passed in. In this example you would want to impute values before scaling so, you would list it first in the Pipeline.

Now you would fit and transform on your training and test data as usual.

Now you can take your finished variables and pass them through any sklearn regression class you would like.

You can even include the model into your Pipeline then use GridSearchCV to search over various parameters of the model with the goal of tuning to improve performance.

If you are not familiar with GridSearchCV it allows you to run multiple versions of a model on different parameters you pass into the class with the goal of minimizing a loss statistic. In the case of linear regression, the default is r-squared.

The big difference when it comes to syntax when using Pipeline in conjunction with GridSearchCV is when you pass in the list of parameters you would like to search over, here is where the reference name you used when instantiating Pipeline comes into play. to avoid having an error kicked back, reference the estimator followed by two underscores and finally the parameter. A really useful tool here is it will allow you to check out for different parameters of each estimator interact with each other on your data. If you notice trends it can often provide insight where otherwise may have gone noticed.

References:

Hands-on Machine Learning, Aurelien Geron

Scikit-Learn Pipeline Basic ‘how-to’

Kevin Potter