Predictive Analytics Intro

ML applications

ML applications

ML applications

What is ML

  • ML is a ‘suitcase word’
  • ML enables machines to be more like ‘human’
  • Turing Test was an early test for AI
  • Technical aspects of ML: Autonomy, Adaptivity
  • Don't use ML when there are clear rules

ML vs. traditional programming

ML terminology

  • Attribute/Feature: A quantity describing an instance
  • Ground truth: The true label
  • Training Data: Input data (features) associated with labels
  • Test Data: Input data (features) hiding the labels (not overlap with Training Data)
  • Learning Algorithm: Given training data, it produces a ML model
  • Model: Automatically generated program based on training data
  • Prediction: Output of a model given input data
  • Accuracy: The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage). Accuracy is usually estimated by using an independent test set that was not used at any time during the learning process.
  • See more here

ML business considerations

  • Volume
    • e.g. speech recognition, handwriting recognition, image classification
  • Variety
    • spam & fraud detection, translation
  • Velocity
    • e.g., automated driving, fraud detection (updating models dynamically)

ML algorithms

Supervised LearningUnsupervised LearningReinforcement Learning
Use labels to train data on featuresFind hidden patterns when there is no labeled dataTake actions based in policies and rewards with some delayed feedback

ML algorithms

ML quiz

Which of these uses ML?

  • Self-driving car
  • Content recommendation
  • A spreadsheet formula
  • Big data processing
  • Product pricing
  • Image recognition
  • Text to speech
  • Advertising auction
  • Fraud credit card transaction detection

Comparing ML models

  • Is one ML model better than the other? Is Google search model better than IBM Watson’s Chess playing model?
  • Intelligence is not a single dimension like temperature. You can compare today's temperature to yesterday's, or the temperature in Helsinki to that in Rome, and tell which one is higher and which is lower. We even have a tendency to think that it is possible to rank people with respect to their intelligence – that's what the intelligence quotient (IQ) is supposed to do. Is a chess-playing algorithm more intelligent than a spam filter, or is a music recommendation system more intelligent than a self-driving car? These questions make no sense. This is because being able to solve one problem tells us nothing about the ability to solve another problem.

ML considerations

  • What is the problem we are trying to solve with the ML model?
  • What business metrics will we use to evaluate model's performance?
  • Do we have good quality training data?
  • Do we have good quality test data?
  • What features should we use?

ML model performance

ML model performance

Accuracy of a model is about reducing FP and FN:

  • Precision (reduce FP) = tp / (tp+fp).Best case precision is 1.
  • Recall (reduce FN) = tp / (tp + fn). Best case recall is 1.
  • Accuracy = # corrects / total = (tp + tn) / (tp+fp+fn+tn).

ML model demo

Linear Regression

  • x's numeric —> y numeric.
  • Try to fit a straight line and derive coefficients and intercept, e.g. y = β0 + β1x1 + β2x2 + β3x3 + β4x4 + …
  • Once we have the model equation, we can use it to predict new y values.
  • Terminology
    • y = label
    • x1,x2,… = features
    • β0,β1,… coefficients.
  • Correlation is typically used in statistics and regression in ML.

Extrapolation

Extrapolation is the use of a regression line for prediction far outside the domain of values of the feature variable x that you use to predict y. This can be dangerous!

Central Limit Theorem

If we draw several independent random samples from a population, each of a fixed large sample size, and plot a distribution of their means (or some point estimate), this sampling distribution of means will approach normal.

CLT helps us address sampling variability by constructing a sampling distribution centered around our sample mean, usually using a 95% confidennce interval. This is because if we picked another sample, there's a 95% likelihood that the new sample mean's 95% range would have the population mean.

Estimated 95% range is: xÌ… +/- 1.96*es.

  • xÌ… = mean of the sample.
  • se of the sample = sd/√sample size.

Proving Central Limit Theorem

library(dplyr)
# proving CLT
population = sample(1:100,size=10000,replace=T) # let’s say this is our population, don't confuse by the name of the function sample with sample
populationmean = mean(population)
samplingdistribution = NULL # sampling distribution of means of samples
numberofsamples = 100 # number of samples or sample means in this distribution
lowrange = highrange = moe = NULL
samplesize = 900 # less than 10% but > 30 to establish independence

for (i in 1:numberofsamples) # run experiment many times
{
  y = sample(population,size=samplesize,replace = T)
  samplingdistribution = append(samplingdistribution,mean(y))

  # check if 95% confidence interval of this sample contains population mean
  lowrange = append(lowrange,psych::describe(y)$mean-1.96*psych::describe(y)$se)
  highrange = append(highrange,psych::describe(y)$mean+1.96*psych::describe(y)$se)
}

# 95% of 95% ranges show population mean within the range proving CLT
allranges = data.frame(samplingdistribution, lowrange, highrange, populationmean)
head(allranges) 
  samplingdistribution lowrange highrange populationmean
1             49.90667 48.01250  51.80083        50.3981
2             47.72667 45.81388  49.63945        50.3981
3             51.09111 49.21649  52.96573        50.3981
4             49.58778 47.66128  51.51427        50.3981
5             51.74000 49.83871  53.64129        50.3981
6             50.31000 48.42538  52.19462        50.3981
withinrange = allranges %>% filter( (populationmean) & (populationmean <= highrange))
head(withinrange) 
  samplingdistribution lowrange highrange populationmean
1             49.90667 48.01250  51.80083        50.3981
2             51.09111 49.21649  52.96573        50.3981
3             49.58778 47.66128  51.51427        50.3981
4             51.74000 49.83871  53.64129        50.3981
5             50.31000 48.42538  52.19462        50.3981
6             49.57556 47.74120  51.40991        50.3981
nrow(withinrange)/nrow(allranges) * 100
[1] 98
# normal distribution of the sampling distribution proving CLT
samplingdistribution = as.data.frame(samplingdistribution)
head(samplingdistribution)
  samplingdistribution
1             49.90667
2             47.72667
3             51.09111
4             49.58778
5             51.74000
6             50.31000
population = as.data.frame(population)
head(population)
  population
1         93
2         57
3          4
4         35
5         90
6         95

Proving Central Limit Theorem

plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2

Linear Regression

library(ggplot2)
# linear regression
# y = ax + intercept
x = c(1,2,3,4,5) # training data
y = c(2,4,6,8,10) # training data
lrModel = lm(y~x)
print(lrModel)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
  2.383e-15    2.000e+00  
df = data.frame(x,y)
df
  x  y
1 1  2
2 2  4
3 3  6
4 4  8
5 5 10
ggplot(df,aes(x=x,y=y)) + geom_point(color="blue") + geom_smooth (model="lm",formula=y~x) 

plot of chunk unnamed-chunk-3

Polynomial Regression

library(ggplot2)
# linear regression
# y = ax + intercept
x = c(1,2,3,4,5) # training data
y = c(1,8,27,64,125) # training data
lrModel = lm(y~I(x^3))
print(lrModel)

Call:
lm(formula = y ~ I(x^3))

Coefficients:
(Intercept)       I(x^3)  
          0            1  
df = data.frame(x,y)
ggplot(df,aes(x=x,y=y)) + geom_point(color="blue") +  geom_smooth (model="lm",formula=y~I(x^3)) 

plot of chunk unnamed-chunk-4

predict(lrModel,data.frame(x=c(6,7,8,9,10))) # predict on test data
   1    2    3    4    5 
 216  343  512  729 1000