August 2, 2018

Overview

  • Analytics toolkits:
    • Business Intelligence
    • Data Analysis
    • Data Science
  • Statistical programming
    • Transparancy
    • Reproducibility
    • Practical considerations
  • Take a break

Overview, part 2

  • Predictive modeling
    • What, how, why, who, when?
    • Basic algorithm types
    • Design, tuning, maintenance
  • Demonstration
    • Download R (optional)
    • Walk through a simple example or two

Part 1: The analytics toolkit

One possible interpretation of analytics roles and toolkits

Business Intelligence

Common tasks
  • Data retrieval, cleaning, and validation
  • Data visualization
  • Report generation
  • Dashboard creation
Toolkit
  • Excel
  • Tableau
  • Business Objects
  • SQL

Data Analysis

Common tasks
  • Data retrieval, cleaning, and validation
  • Data visualization
  • Exploratory data analysis
  • Report generation and dashboard creation
  • Statistical analysis and inference
Toolkit
  • Excel
  • SQL
  • Tableau/Business Objects
  • R/Python

Data Science

Common tasks
  • Data retrieval, cleaning, validation
  • Data visualization
  • Exploratory data analysis
  • Feature engineering
  • Statistical analysis and inference
  • Predictive model design, creation, and evaluation
Toolkit
  • SQL
  • Tableau
  • R/Python
  • Big data tools (Hadoop, Hive, Spark, TensorFlow, etc.)

Pause to talk about specific tools

Because a hammer is a hammer, but we all know some hammers just feel better than others.

Statistical Programming

Why statistical programming is valuable

Transparency

transparent adj

  1. allowing light to pass through so that objects behind can be distinctly seen
  2. easy to perceive or detect
  3. open to scrutiny
  • Why is this important?
  • What are the downsides of lacking transparency?


  • Imagine some scenarios.

Reproducibility

Related to transparency, reproducibility is important for 2 reasons:

  1. It ensures that the same analysis on the same data yields the same results.
    • Even if it's done by someone else!
  2. By extension, it allows for the repeated use of a script for multiple analyses.


  • What are some dangers or downsides of not using reproducible methods?

Practical Considerations

There are also extremely practical reasons to use R.

Limitations
  • What are the limitations of Excel?
  • Of Tableau?
  • Of SQL?
  • Of R/Python?
Benefits not yet mentioned
  • Collaboration
  • Version control

Summary of pros and cons

Pros

  • Transparency
  • Reproducibility
  • Faster processing
  • Bigger data sets
  • Full suite of math and stat tools

Cons

  • Learning curve
  • Have to program everything
  • Need to translate work efforts to non-programming associates

Break

Predictive modeling

What is a predictive model?

predictive adj

  • relating to or having the effect of predicting an event or result

model noun

  • a system or thing used as an example to follow or imitate

So simply – a predictive model is a system that tries to predict an outcome.

Predictive modeling, continued

  • What is the simplest predictive model you can think of?
  • What are some other types of predictive models?



Examples:

  • Use the average/most common value
  • Statistical methods (linear/logistic regression, clustering)
  • Machine Learning

Predictive modeling, Machine Learning

Interviewer: What is your biggest strength?
Me: I'm an expert in machine learning.
Interviewer: What's 6+10?
Me: Zero.
Interviewer: That's not even close. It's 16.
Me: Okay, it's 16.
Interviewer: What is 10+20?
Me: It's 16.

  • Machine learning is just one kind of predictive modeling, but in the analytics world that the two terms are often used interchangeably. Let's talk through the basic process.

Let's assume we have a perfectly clean and complete data set.

We we separate the data set into 2-3 chunks:

  • Training data


- Testing data


- Validation data (sometimes optional)

Building a model

What are some things to consider when building a predictive model?

We'll discuss, and I'll present a far-from-exhaustive list.

  • What kind of prediction are we making?
  • What is our tolerance for error?
  • How quickly do we need the model to perform?
  • How transparent/interpretable does the model need to be?
  • Will the model need to be changed often?
  • How big is the data?
  • What data types are we dealing with?
  • What did I miss??

Evaluating models

We have a model. Now what?


What makes a model "good"?


  • ROC, RMSE, AUC, Gini, CV, Confusion Matrix, and others.
  • Accuracy is important, but what else?

Machine Learning flow chart

Model Maintenance

There are no hard-and-fast rules. Some things to think about:

  • How frequently the data updates
  • How often the model is used
  • Changes in requirements
  • Environment/conditions change
  • The newness of the model
  • Potential downsides/risks

Model Recalibration vs Model Rebuild??

Examples and/or questions

Depending on how much time is left