Predictive Analytics: Introduction and Overview

Rasim Muzaffer Musal

Elements of Statistical Learning Notes

This set of lecture notes are based on chapter 2 of “Elements of Statistical Learning” by Hastie, Tibshirani and Friedman.
“The Akaike information criterion: Background, derivation,properties, application, interpretation, and refinements” 2019, Cavanaugh and Neath

Overview

This class assumes you are familiar with Multiple Linear Regression and Logistic Regression.
We will build on this knowledge base to improve our predictions and try to find solutions to other challenges.
First let us get comfortable with notation and terminology.

Supervised Models

We have labeled values as the target/response variable Y.
We have labeled values as predictor(s) X. \[ \begin{aligned} Y=f(X) \end{aligned} \]
f is the function where in M.L.R. f is defined as

\[ \begin{aligned} Y=\beta_{0}+\beta_{1} \times X_{1}\ldots \beta_{K} \times X_{K} + \epsilon \end{aligned} \]

Learning about the Learner I

f is an assumed learning function.
In M.L.R, the function itself is the form involving linear sum of model parameters (\(\beta\)) and features (predictors/independent variables) \(X\).
Learning, in this context, refers to knowing more about the model parameters .

Learning about the Learner II

We will never know the true f that generates the data.
However we will be able to say whether a particular model supports data better than another.
There is a difference between inferential models and predictive models.
Do we only care about how changing features would effect the target variable?
Do we only care about accuracy and precision around the predictions?

What sort of problems can we solve using f?

Ex_1: Binary Classification
- We would like to know whether a healthcare provider commits Fraud Waste or Abuse in Medicare/Medicaid.
- Data is available on whether providers committed fraud was or abuse \(Y\).
- Data is available on providers regarding their business practices \(X\).

What sort of problems can we solve using f?

Ex_2: Multi Category Classification
- We would like to find the probability that an athlete is going to increase/decrease/no change in maximal vertical jump.
- Data is available on how many inches an athlete \(Y\) have vertically jumped through \(T\) weeks.
- Data is available on training history and subjective assessment on their health \(X\).

What sort of problems can we solve using f?

Ex_3: Regression
- We would like to predict house prices.
- Data is available on house sales in the last 2 years \(Y\).
- Data is available on the feature of houses that were sold.

What problems does this class help solve?

We already know applying linear and logistic regression.
If we have k features where k is either more than \(n\) or a relatively large number what do we do?
If our purpose is mainly prediction, how do we complicate our models for better fit?
What trade offs are there between model complications and different types of errors.
What are the methods to compare models?

Concepts to go over:

Probability Distribution: Description of uncertainty involving how data is generated.
Likelihood: Joint probability density function of data for fixed parameters.
Entropy: Amount of Information in data.

What are the two main approaches to predictions?

Global Models
- Ex: Linear Models.
Local models
- Ex: K-Nearest Neighbor Models.

Linear Models

Look at the Linear Model equation again

\[ \begin{aligned} Y=\beta_{0}+\beta_{1} \times X_{1}\ldots \beta_{K} \times X_{K} + \epsilon \end{aligned} \]

This model implies:
1. Whatever the value of \(X_{1}\) is, if you increase it by one unit, Y is going to increase by \(\beta_{1}\) units.
2. The implication in (a) is true regardless of the value of \(X_{2}\).

Starting out with a Theoretical Error

The following code and data is in: https://rdrr.io/cran/ElemStatLearn/man/mixture.example.html
We gather from the data object the features X and response/target variable y

library(ElemStatLearn)
library(ggplot2)
library(tidyr)
library(HelpersMG)
library(ggpubr)
x <- mixture.example$x
 g <- mixture.example$y

The dimensions of object x is 200 rows and 2 columns. This in turn means the number of associated responses is also 200.
The response variable 100 values of 0 and 100 values of 1.

Starting out with a Theoretical Error

(Intercept)          x1          x2 
  0.3290614  -0.0226360   0.2495983

[1] "Number of Predictions"

[1] 200

[1] "Number of correct Predictions"

[1] 146

K-Nearest Neighbor Models

To predict the value of Y (\(\hat{Y}\)), average out K of the \(Y_{i}\) (s) that is closest to \(X_{i}\)s. \[ \begin{aligned} \hat{Y}=\frac{1}{K} \sum_{x_{i} \in N_{k}(x) }{y_{i}} \end{aligned} \]
If this is a classification problem \(\hat{Y}\) is a proportion.

Rethinking the ideas behind the models I

Look back at the two general model families we discussed.
- Linear models are global in the sense that there is no constraint on the range of values of X when you make inferences on how X effects Y.
- K-Nearest Neighbors on the other hand, constrain predictions of \(Y_{i}\) to a radius of neighbors determined by distance measurements established by their association with their respective \(X_{i}\)s.

Pros and Cons of Global and Local

As in most things in life there is no superior solution.
For some problems we use a hammer, and others a wrench.
Global models make generalized inferences but inflexible. The relationship between \(X\) and \(Y\) are fixed.
Local models are flexible by establishing the relationship between \(X\) and \(Y\) based on neighboring values.
Many models exist between the two extreme cases.

Perfect Prediction

The more complication you are willing to introduce to your model, the better the fit is going to be.
Constructing new models from variables you already have, interactions, polynomials etc…
However, the more complicated the model is, the more likely you will decrease the generalizibility of it.
The overfitting and underfitting conundrum is referred to as the bias-variance tradeoff.

Overfitting

To understand first, bias vs variance tradeoff we will conduct simulations and demonstrate overfitting, and introduce basic concepts.
If you have taken ANYL 3334 you should know that if you have a dataset of N responses, N-1 features (covariates, independent variables) would perfectly predict these N responses.
How are we to know in the non-extreme cases whether we are overfitting our model?

Overfitting

We could use cross validation methods of dividing our dataset into training and test etc. However
1. Not all models can be cross validated straightforwardly (time-series, spatial models)
2. It would be a resource waste and in many cases not feasible to cross validate every possible model.
3. This is why we need measures to provide good candidate models.
4. Cross validation or Informational Criteria.

Informational Criteria

Akaike’s Information Criterion
-2\(LogLikelihood_{n}(\theta_{k})+(2\times k)\)
Bayesian Information Criterion
-2\(LogLikelihood_{n}(\theta_{k})+(log(n)\times k)\)
Goodness of Fit + Complication Penalty

Akaike Information Criterion

“Information Theory and an Extension of the Maximum Likelihood Principle.” Hirotugu Akaike (1973)
“The Akaike information criterion: Background, derivation, properties, application, interpretation, and refinements” Cavanaugh, Neath (2019)

Ordinary Least Square Regression Example

Recall minimizing mean squared errors and MLE. \[ \begin{aligned} Min \frac{1}{N}\sum_{i}^{N}(Y_i-Pred_i)^{2} & \\ (X'X)^{-1}X'Y & \end{aligned} \] lead to the same estimators of \(\hat{\beta}\).

Ordinary Least Square Regression Example

Recall also that you can draw a prediction line without the assumption of normally distributed errors to minimize Sum of Squared Errors.
The assumption of normality for residuals create a more inferential framework.
Confidence and Prediction Intervals
Both on \(Y\) and \(\beta\)

What comes next

Being able to run some new method in the age of automated intelligence is not a skill.

Comprehending the process that generates the data. In other words, at least being able to communicate with an SME.
Comprehending the structure behind the models and the methods.

To dive deeper into predictive models we need to discuss two main topics: Likelihood and Entropy for 2.