Predictive Analytics Intro

ML applications

What is ML

ML is a ‘suitcase word’
ML enables machines to be more like ‘human’
Turing Test was an early test for AI
Technical aspects of ML: Autonomy, Adaptivity
Don't use ML when there are clear rules

ML vs. traditional programming

ML terminology

Attribute/Feature: A quantity describing an instance
Ground truth: The true label
Training Data: Input data (features) associated with labels
Test Data: Input data (features) hiding the labels (not overlap with Training Data)
Learning Algorithm: Given training data, it produces a ML model
Model: Automatically generated program based on training data
Prediction: Output of a model given input data
Accuracy: The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage). Accuracy is usually estimated by using an independent test set that was not used at any time during the learning process.
See more here

ML business considerations

Volume
- e.g. speech recognition, handwriting recognition, image classification
Variety
- spam & fraud detection, translation
Velocity
- e.g., automated driving, fraud detection (updating models dynamically)

ML algorithms

Supervised Learning	Unsupervised Learning	Reinforcement Learning
Use labels to train data on features	Find hidden patterns when there is no labeled data	Take actions based in policies and rewards with some delayed feedback

ML algorithms

Also see ML Cheatsheet

ML quiz

Which of these uses ML?

Self-driving car
Content recommendation
A spreadsheet formula
Big data processing
Product pricing
Image recognition
Text to speech
Advertising auction
Fraud credit card transaction detection

Comparing ML models

Is one ML model better than the other? Is Google search model better than IBM Watson’s Chess playing model?
Intelligence is not a single dimension like temperature. You can compare today's temperature to yesterday's, or the temperature in Helsinki to that in Rome, and tell which one is higher and which is lower. We even have a tendency to think that it is possible to rank people with respect to their intelligence – that's what the intelligence quotient (IQ) is supposed to do. Is a chess-playing algorithm more intelligent than a spam filter, or is a music recommendation system more intelligent than a self-driving car? These questions make no sense. This is because being able to solve one problem tells us nothing about the ability to solve another problem.

ML considerations

What is the problem we are trying to solve with the ML model?
What business metrics will we use to evaluate model's performance?
Do we have good quality training data?
Do we have good quality test data?
What features should we use?

ML model performance

Also see on the wiki: Precision and Recall

ML model performance

Accuracy of a model is about reducing FP and FN:

Precision (reduce FP) = tp / (tp+fp).Best case precision is 1.
Recall (reduce FN) = tp / (tp + fn). Best case recall is 1.
Accuracy = # corrects / total = (tp + tn) / (tp+fp+fn+tn).

ML model demo

A visual intro to ML

Linear Regression

x's numeric —> y numeric.
Try to fit a straight line and derive coefficients and intercept, e.g. y = β0 + β1x1 + β2x2 + β3x3 + β4x4 + …
Once we have the model equation, we can use it to predict new y values.
Terminology
- y = label
- x1,x2,… = features
- β0,β1,… coefficients.
Correlation is typically used in statistics and regression in ML.

Extrapolation

Extrapolation is the use of a regression line for prediction far outside the domain of values of the feature variable x that you use to predict y. This can be dangerous!

Central Limit Theorem

If we draw several independent random samples from a population, each of a fixed large sample size, and plot a distribution of their means (or some point estimate), this sampling distribution of means will approach normal.

CLT helps us address sampling variability by constructing a sampling distribution centered around our sample mean, usually using a 95% confidennce interval. This is because if we picked another sample, there's a 95% likelihood that the new sample mean's 95% range would have the population mean.

Estimated 95% range is: x̅ +/- 1.96*es.

x̅ = mean of the sample.
se of the sample = sd/√sample size.

Proving Central Limit Theorem

library(dplyr)
# proving CLT
population = sample(1:100,size=10000,replace=T) # let’s say this is our population, don't confuse by the name of the function sample with sample
populationmean = mean(population)
samplingdistribution = NULL # sampling distribution of means of samples
numberofsamples = 100 # number of samples or sample means in this distribution
lowrange = highrange = moe = NULL
samplesize = 900 # less than 10% but > 30 to establish independence

for (i in 1:numberofsamples) # run experiment many times
{
  y = sample(population,size=samplesize,replace = T)
  samplingdistribution = append(samplingdistribution,mean(y))

  # check if 95% confidence interval of this sample contains population mean
  lowrange = append(lowrange,psych::describe(y)$mean-1.96*psych::describe(y)$se)
  highrange = append(highrange,psych::describe(y)$mean+1.96*psych::describe(y)$se)
}

# 95% of 95% ranges show population mean within the range proving CLT
allranges = data.frame(samplingdistribution, lowrange, highrange, populationmean)
head(allranges)

  samplingdistribution lowrange highrange populationmean
1             49.90667 48.01250  51.80083        50.3981
2             47.72667 45.81388  49.63945        50.3981
3             51.09111 49.21649  52.96573        50.3981
4             49.58778 47.66128  51.51427        50.3981
5             51.74000 49.83871  53.64129        50.3981
6             50.31000 48.42538  52.19462        50.3981

withinrange = allranges %>% filter( (populationmean) & (populationmean <= highrange))
head(withinrange)

  samplingdistribution lowrange highrange populationmean
1             49.90667 48.01250  51.80083        50.3981
2             51.09111 49.21649  52.96573        50.3981
3             49.58778 47.66128  51.51427        50.3981
4             51.74000 49.83871  53.64129        50.3981
5             50.31000 48.42538  52.19462        50.3981
6             49.57556 47.74120  51.40991        50.3981

nrow(withinrange)/nrow(allranges) * 100

[1] 98

# normal distribution of the sampling distribution proving CLT
samplingdistribution = as.data.frame(samplingdistribution)
head(samplingdistribution)

  samplingdistribution
1             49.90667
2             47.72667
3             51.09111
4             49.58778
5             51.74000
6             50.31000

population = as.data.frame(population)
head(population)

  population
1         93
2         57
3          4
4         35
5         90
6         95

Proving Central Limit Theorem

plot of chunk unnamed-chunk-2

Linear Regression

library(ggplot2)
# linear regression
# y = ax + intercept
x = c(1,2,3,4,5) # training data
y = c(2,4,6,8,10) # training data
lrModel = lm(y~x)
print(lrModel)


Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
  2.383e-15    2.000e+00

df = data.frame(x,y)
df

ggplot(df,aes(x=x,y=y)) + geom_point(color="blue") + geom_smooth (model="lm",formula=y~x)

plot of chunk unnamed-chunk-3

Polynomial Regression

library(ggplot2)
# linear regression
# y = ax + intercept
x = c(1,2,3,4,5) # training data
y = c(1,8,27,64,125) # training data
lrModel = lm(y~I(x^3))
print(lrModel)


Call:
lm(formula = y ~ I(x^3))

Coefficients:
(Intercept)       I(x^3)  
          0            1

df = data.frame(x,y)
ggplot(df,aes(x=x,y=y)) + geom_point(color="blue") +  geom_smooth (model="lm",formula=y~I(x^3))

plot of chunk unnamed-chunk-4

predict(lrModel,data.frame(x=c(6,7,8,9,10))) # predict on test data

   1    2    3    4    5 
 216  343  512  729 1000