Ch. 1 - Light My Fire: Starting To Use Spark With dplyr Syntax

Getting Started

Made for each other

Here be dragons

The connect-work-disconnect pattern

Copying data into Spark

Big data, tiny tibble

Exploring the structure of tibbles

Selecting columns

Filtering rows

Arranging rows

Mutating columns

Summarizing columns


Ch. 2 - Tools of the Trade: Advanced dplyr Usage

Levelling up

Mother’s little helper (1)

Mother’s little helper (2)

Selecting unique rows

Common people

Collecting data back from Spark

Storing intermediate results

Groups: great for music, great for data

Groups of mutants

Advanced Selection II: The SQL

Left joins

Anti joins

Semi joins


Ch. 3 - Going Native: Use The Native Interface to Manipulate Spark DataFrames

Two new interfaces

Popcorn double feature

Transforming continuous variables to logical

Transforming continuous variables into categorical (1)

Transforming continuous variables into categorical (2)

More than words: tokenization (1)

More than words: tokenization (2)

More than words: tokenization (3)

Sorting vs. arranging

Exploring Spark data types

Shrinking the data by sampling

Training/testing partitions


Ch. 4 - Case Study: Learning to be a Machine: Running Machine Learning Models on Spark

Machine Learning on Spark

Machine learning functions

(Hey you) What’s that sound?

Working with parquet files

Come together

Partitioning data with a group effect

Gradient boosted trees: modeling

Gradient boosted trees: prediction

Gradient boosted trees: visualization

Random Forest: modeling

Random Forest: prediction

Random Forest: visualization

Comparing model performance

An interview with Javier Luraschi and Kevin Ushey


About Michael Mallari

Michael is a hybrid thinker and doer—a byproduct of being a StrengthsFinder “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.

Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.

LinkedIn | Twitter | michaelmallari.com