Introduction to R, or Why you should learn to stop worrying and love statistical programming.

August 2, 2018

Overview

Analytics toolkits:
- Business Intelligence
- Data Analysis
- Data Science
Statistical programming
- Transparancy
- Reproducibility
- Practical considerations
Take a break

Overview, part 2

Predictive modeling
- What, how, why, who, when?
- Basic algorithm types
- Design, tuning, maintenance
Demonstration
- Download R (optional)
- Walk through a simple example or two

Part 1: The analytics toolkit

One possible interpretation of analytics roles and toolkits

Business Intelligence

Common tasks

Data retrieval, cleaning, and validation
Data visualization
Report generation
Dashboard creation

Toolkit

Excel
Tableau
Business Objects
SQL

Data Analysis

Common tasks

Data retrieval, cleaning, and validation
Data visualization
Exploratory data analysis
Report generation and dashboard creation
Statistical analysis and inference

Toolkit

Excel
SQL
Tableau/Business Objects
R/Python

Data Science

Common tasks

Data retrieval, cleaning, validation
Data visualization
Exploratory data analysis
Feature engineering
Statistical analysis and inference
Predictive model design, creation, and evaluation

Toolkit

SQL
Tableau
R/Python
Big data tools (Hadoop, Hive, Spark, TensorFlow, etc.)

Pause to talk about specific tools

Because a hammer is a hammer, but we all know some hammers just feel better than others.

Statistical Programming

Why statistical programming is valuable

Transparency

transparent adj

~~allowing light to pass through so that objects behind can be distinctly seen~~
easy to perceive or detect
open to scrutiny

Why is this important?

What are the downsides of lacking transparency?

Imagine some scenarios.

Reproducibility

Related to transparency, reproducibility is important for 2 reasons:

It ensures that the same analysis on the same data yields the same results.
- Even if it's done by someone else!
By extension, it allows for the repeated use of a script for multiple analyses.

What are some dangers or downsides of not using reproducible methods?

Practical Considerations

There are also extremely practical reasons to use R.

Limitations

What are the limitations of Excel?
Of Tableau?
Of SQL?
Of R/Python?

Benefits not yet mentioned

Collaboration
Version control

Summary of pros and cons

Pros

Transparency
Reproducibility
Faster processing
Bigger data sets
Full suite of math and stat tools

Cons

Learning curve
Have to program everything
Need to translate work efforts to non-programming associates

Break

Predictive modeling

What is a predictive model?

predictive adj

relating to or having the effect of predicting an event or result

model noun

a system or thing used as an example to follow or imitate

So simply – a predictive model is a system that tries to predict an outcome.

Predictive modeling, continued

What is the simplest predictive model you can think of?
What are some other types of predictive models?

Examples:

Use the average/most common value
Statistical methods (linear/logistic regression, clustering)
Machine Learning

Predictive modeling, Machine Learning

Interviewer: What is your biggest strength?
Me: I'm an expert in machine learning.
Interviewer: What's 6+10?
Me: Zero.
Interviewer: That's not even close. It's 16.
Me: Okay, it's 16.
Interviewer: What is 10+20?
Me: It's 16.

Machine learning is just one kind of predictive modeling, but in the analytics world that the two terms are often used interchangeably. Let's talk through the basic process.

Let's assume we have a perfectly clean and complete data set.

We we separate the data set into 2-3 chunks:

Training data

- Testing data

- Validation data (sometimes optional)

Building a model

What are some things to consider when building a predictive model?

We'll discuss, and I'll present a far-from-exhaustive list.

What kind of prediction are we making?
What is our tolerance for error?
How quickly do we need the model to perform?
How transparent/interpretable does the model need to be?
Will the model need to be changed often?
How big is the data?
What data types are we dealing with?
What did I miss??

Evaluating models

We have a model. Now what?

What makes a model "good"?

ROC, RMSE, AUC, Gini, CV, Confusion Matrix, and others.

Accuracy is important, but what else?

Machine Learning flow chart

Model Maintenance

There are no hard-and-fast rules. Some things to think about:

How frequently the data updates
How often the model is used
Changes in requirements
Environment/conditions change
The newness of the model
Potential downsides/risks

Overview

Overview, part 2

Part 1: The analytics toolkit

Business Intelligence

Common tasks

Toolkit

Data Analysis

Common tasks

Toolkit

Data Science

Common tasks

Toolkit

Pause to talk about specific tools

Statistical Programming

Transparency

Reproducibility

Practical Considerations

Limitations

Benefits not yet mentioned

Summary of pros and cons

Pros

Cons

Break

Predictive modeling

Predictive modeling, continued

Examples:

Predictive modeling, Machine Learning

Building a model

Evaluating models

Model Maintenance

Model Recalibration vs Model Rebuild??

Examples and/or questions

Depending on how much time is left