What is it?

"Statistical learning refers to a set of tools for modeling and understanding complex datasets." (An Introduction to Statistical Learning)

A lot of hype around it ("big data", "neural networks", regular statistics is dead and other BS), more balanced view (yes you need to understand math, stat, scientific research methods to do statistical/machine learning): 50 years of Data Science

Visual intro

Why do we need it?

  • Lots of data (IOT)
  • Speed (Twitter trends)
  • People have higher demands (Amazon)
  • Intuition fails

Intuition fails - Monty Hall problem

How to make models?

First you need:

  • data
  • metrics
  • algorithm(s)
  • machines that can process it

Data

  • Know your data!!! (how was it gathered, what are restricitons? Example, why does it matter)
  • Are variables numeric, text, factors?
  • Clean data (this is taking up to 80%-90% of the time)

Metrics

  • Set up metrics to evaluate how good your model is. (accuracy, recall, precision, ROC curve, etc.)
  • Depends on the problem (what kind of mistakes are allowed, what is their cost)

Algorithm(s)

  • what kind of problem it is?
    • Classification (spam/no spam, healthy/sick. Naive Bayes)
    • Clustering (what people have similar shopping pattern? KMeans clustering)
    • Regression (what is price of house in NY specific area. Linear regression)
    • Dimensionality reduction (Principal Component Analysis)
  • Choose one you know. There are too many algorithms for you to know them all
  • Help: An Introduction to Statistical Learning, The Elements of Statistical Learning

Evaluate your model

  • Keep one set of data as test data (to test your model)
  • Data you use for modelling is training data
  • Build model, use test data (for example 30% of initial dataset is for testing, 70% for training) for evaluation
  • Cross validation

Cross validation

  • Train model on data 2-10, test on 1
  • Next loop, train on data 1,3-10, test on 2
  • etc. (do it 10 times)

Dangers

  • Accuracy 99%, is it good?
  • Overfitting (model is not learning but memorizing)
  • Variance Bias tradeoff:

Variance Bias tradeoff

Overfitting

Which model is most useful? Why?

Machines

  • Cloud based solutions for example ML Azure)
  • Big data, one computer (even bunch of them) can't handle it
  • Hadoop:

How to start

  • Kaggle
    • Titanic competition
    • Tutorial
    • Visualize data from different perspective, what might be good predictor if person died or survived?
    • Do feature engineering: create new variables based on existing ones