Predictive Analytics: Introduction and Overview

Rasim Muzaffer Musal

Today’s Lecture

This Lecture is going to draw from:

“The Akaike information criterion: Background, derivation, properties, application, interpretation, and refinements” Joseph E. Cavanaugh, Andrew A. Neath, 2019

Topics of Discussion

An important topic to discuss in predictive analytics is the trade off between model complexity and model fit.
Akaike’s Information Criterion (AIC)
-2\(LogLikelihood_{n}(\theta_{k})+(2\times k)\)
Bayesian Information Criterion (BIC)
-2\(LogLikelihood_{n}(\theta_{k})+(log(n)\times k)\)
Goodness of Fit + Complication Penalty

AIC Derivation

Assume the variable Y is generated from a function, referred to as generating function g(Y)
\(f(y|\boldsymbol\theta_{k})\) represents candidate model where the \(\boldsymbol\theta_{k}\) represents \(k\) parameters of \(k\) dimensions.
\(f(y|\boldsymbol\theta_{k})\) is the likelihood we discussed earlier. \(L(y;\boldsymbol\theta_{k})\)

\[\begin{align} I(\boldsymbol\theta_{k})=E \bigg[{log \frac{g(y)}{f(y|\boldsymbol\theta_{k})}} \bigg] \end{align}\]

AIC Derivation I

\[\begin{align} I(\boldsymbol\theta_{k})=E_{g(y)} \bigg[{log \frac{g(\boldsymbol y)}{f(\boldsymbol y|\boldsymbol\theta_{k})}} \bigg] \end{align}\]

Expectation is with respect to g(y).
This is the mean of the measure in brackets under repeated realizations y generated under the true model \(g(y)\).
\(I(\theta_{k})\) is the Kullback-Leibler distance between the true generating model function \(g(\boldsymbol y)\) and candidate model’s function \(f(\boldsymbol y|\boldsymbol\theta_{k})\)

AIC Derivation II

Define: \(d(\boldsymbol\theta_{k})=E[-2log[f(\boldsymbol y|\boldsymbol\theta_{k})]]\)

\[\begin{align} 2 I(\boldsymbol\theta_{k})= E \bigg[2 log [g(\boldsymbol y)] - 2 f(\boldsymbol y|\boldsymbol\theta_{k}) \bigg] \end{align}\] - This can be rewritten as

\[\begin{align} 2 I(\boldsymbol\theta_{k})= E \bigg[2 log [g(\boldsymbol y)] + d(\boldsymbol\theta_{k})\bigg] \end{align}\]

AIC Derivation III

Since there is no symmetry Kullback-Leibler is not distance.
Recall that the likelihood evaluates to a single value.
\(I(\boldsymbol\theta_{k})\) is a measure of expectation.
As the difference between \(g(y)\) and \(f(\boldsymbol y|\boldsymbol\theta_{k})\) grows \(I(\boldsymbol\theta_{k})\) also grows.

AIC Derivation IV

Since in the \(I(\boldsymbol\theta_{k})\) measure, the function \(g(y)\) does not depend on \(\theta\), when ranking candidate models we do not have to consider it.
Therefore \(d(\boldsymbol\theta_{k})\) is an index that can stand for \(I(\boldsymbol\theta_{k})\). and is referred to as Kullback discrepancy.
Even though the MLE for \((\boldsymbol\theta_{k})\), \((\boldsymbol{\hat\theta_{k}})\) can be used to estimate the candidate model’s likelihood, the expectation is with respect to \(g(y)\) and therefore \(d(\boldsymbol\theta_{k})\) is not available.

AIC Derivation V

\(-2 log f(\boldsymbol y|\hat{\boldsymbol\theta})\) is a negatively biased estimator of \(d(\boldsymbol{\hat\theta_{k}})\) because of using M.L.E.s \(\hat{\boldsymbol\theta}\) to substitute for \(\boldsymbol\theta\)
The \(2k\) becomes an adjustment factor to correct for this bias.

\[\begin{align} AIC = -2log f(\boldsymbol y| \boldsymbol{\hat\theta_{k}})+2k \end{align}\]

will asymptotically approach the expected value of \(d(\hat\theta_{k})\)

For more details on how 2k is derived see Cavaaugh and Neath 2019

AIC Properties 1

Different forms of \(f(\boldsymbol y| \hat{\boldsymbol \theta_{k}})\) can be compared but models that includes transformations of \(y\) can not be compared.
If different forms of \(f\) are compared we need to make sure to keep the constants.
k is the expected value of the \(\chi^2\) distribution that accounts for the bias due to M.L.E. being used instead of \(\boldsymbol \theta\)
If k is large and n is small AIC is negatively biased. That is why there are modifications, for instance AICc.
AIC will tend to select the model that minimizes out of sample error.

Simulations

library(tidyverse)
library(ggpubr)
library(MuMIn)
#Sample Size.
N=500
#Number of independent variables.
K=N-1
#Create a target variable with no structure to X.
Y=rnorm(n=N,mean=0,sd=1)
X=matrix(nrow = N,ncol=K,data=rnorm(N*K,0,1))
dim(X)

[1] 500 499

There is no relationship between any of the variables within X. Each of the N*K values are independently generated.
There is no relationship between Y and X.

Recalling measures of fit

\(r^2\) is percentage of variation you can explain in your response variable.
\(r^2=\frac{SSR}{SST}=1-\frac{SSE}{SST}\)
In other words if your \(r^2\) is 1 your sum of squared residuals is equal to 0.
Implying that your model does not make any errors involving your predictions.

Fit vs Model Complexity

The problem involving \(r^{2}\) is obvious
Measures such as adjusted \(r^{2}\), Akaike’s Information Criterion (AIC), Bayesian Information Criterion (BIC) are different indices that try to account for goodness of fit vs model complexity.
- Adjusted \(r^2\) = \(1- (1-r^{2})\frac{N-1}{N-K-1}\)
- AIC = -2\(\times\)loglikelihood + 2\(\times\)K
- BIC = -2\(\times\)loglikelihood +log(N)\(\times\)K

AICc

For “small” samples an adjustment of AIC’s bias leads to AICc.

\[\begin{align} AICc = n \times log \hat{\sigma}^{2}_{n}+ \frac{n(n+p)}{n-p-2} \end{align}\]

where \(\hat{\sigma}^{2}_{n}\) is the variance of the residuals.

Simulations

What kind of a model do we end up having if a linear regression is created between matrix X with N-1 features and vector Y with N responses?
We are using the simulation set up from slide.
We will keep track of MSE, \(r^{2}\), \(adjr^{2}\)

#Construct Multiple Linear Regression
object=lm(Y~X)
#Populate the object. resid is not an array.
resid = residuals(object)

Simulations

#Mean Square Error
mean(resid^2)

[1] 0

#Regression Output
sum_object=summary(object)
#R^2 value
sum_object$r.squared

[1] 1

# Adjusted R2 is not defined due to denominator being 0
sum_object$adj.r.squared

[1] NaN

Simulations

We will demonstrate how \(r^{2}\), adjusted \(r^2\), AIC and BIC behaves as we reduce the number of variables in the model.
Ideally what do we expect for the indices that account for model complexity?
We would like them to show that there is no value to the model to speak of, however we would expect these measures to relatively improve as we reduce the number of independent variables.

Simulations

# L is the number of columns to iteratantly remove in the for loop.  
L=N-1
# L is also number of models that are going to be run. 
# Iteration Information is going to be stored. 
# Array sizes take this into account.

MSE_array=array(dim=c(L+1))
ar2=array(dim=c(L+1))
r2=array(dim=c(L+1))
aic=array(dim=c(L+1))
bic=array(dim=c(L+1))
# We will need the iteration index in plotting functions.
run_ar2=array(dim=c(L+1))

Simulations-Explanation

We simulate L number of models in a for loop.
Each iteration removes one additional feature from the model.
In every iteration of the loop, we obtain information regarding adjusted \(r^2\), \(r^2\).
We would expect \(r^2\) to increase with number of features in the model while \(adj r^{2}\) would take into account number of features.
Intercept is part of number of features.

Simulations: Code adjr2 and r2

object=lm(Y~1)
resid=residuals(object)
sum_object=summary(object)
MSE_array[1]=mean(resid^2)
ar2[1]=sum_object$adj.r.squared
r2[1]=sum_object$r.squared
run_ar2[1]=1

for(l in 1:(L-1)){
  object=lm(Y~X[,c(1:l)])
  resid=residuals(object)
  sum_object=summary(object)
  MSE_array[l+1]=mean(resid^2)
  ar2[l+1]=sum_object$adj.r.squared
  r2[l+1]=sum_object$r.squared
  run_ar2[l+1]=l+1
}

object=lm(Y~X)
resid=residuals(object)
sum_object=summary(object)
MSE_array[N]=mean(resid^2)
ar2[N]=sum_object$adj.r.squared
r2[N]=sum_object$r.squared
run_ar2[N]=N

R2 and Adjusted R2: Plotting Code

fits=as.data.frame(cbind(run_ar2,ar2,r2))
MSE_plot=as.data.frame(cbind(run_ar2,MSE_array))

names(fits)=c("run_ar2","ar2","r2")
names(MSE_plot)=c("run_ar2","MSE")

vfits=pivot_longer(fits,cols=c("ar2","r2"),names_to="names",values_to = "values")

adjr2=ggplot(data=vfits,aes(x=run_ar2,y=values))+
  geom_line(aes(linetype=names,color=names),linewidth=2)+
  scale_x_continuous("Num. Features",breaks=c(1,seq(50,N,by=50)))+
  facet_wrap(~names)

R2 and Adjusted R2: Plot

MSE: Plot

Setup for Simulations

Setting up the simulation parameters. - N is the num. of obs. K is the num. features.
L is the number of times the loop is going to run.
Y is target variable.
X sets up the matrix of features.
No correlation/causal structure exists between columns of X and Y.

Simulation Parameters

N=500
K=N-1
L=N-100
#Target Variable
Y=rnorm(n=N,mean=0,sd=1)
#Features
X=matrix(nrow = N,ncol=K,data=rnorm(N*K,0,1))

Information Arrays

Setting up arrays of size N to collect AIC and BIC values.
Iteration 1 is going to be outside the loop and contain N-1 variables.
Afterwards models will be created vi a sequential removal of one column at each iteration.
Last iteration, is also outside the loop and collects the N’th element of information indices for the intercept only model.

Execution and Acquiring the Indices

object=glm(Y~1,family=gaussian)
sum_object=summary(object)
aic[1]=sum_object$aic

bic[1]=extractAIC(object,k=log(nrow(X)))[2]
run_ic[1]=1

for(l in 1:L){
  object=glm(Y~X[,c(1:l)],family=gaussian)
  sum_object=summary(object)
  
  aic[l+1]=sum_object$aic
  
  bic[l+1]=extractAIC(object,k=log(nrow(X)))[2]
  run_ic[l+1]=l+1
}

#object=glm(Y~X,family=gaussian)
#sum_object=summary(object)

#aic[N]=sum_object$aic

#bic[N]=extractAIC(object,k=log(nrow(X)))[2]
#run_ic[N]=ncol(X)+1

Preparing the Dataframe and GGPLOT object

#Create a dataframe for plots
aic_plot=as.data.frame(cbind(run_ic,aic,bic))
#Name the columns
names(aic_plot)=c("run_ic","aic","bic")
#For plotting make the dataframe longer
vaic_plot=pivot_longer(
  aic_plot,cols=c("aic","bic"),
  names_to="names",values_to = "values")

ic=ggplot(data=vaic_plot,aes(x=run_ic,y=values))+
  geom_line()+scale_x_continuous("Num. Features",breaks=c(1,50,100,150,200,250,300,350,400,450,L))+
facet_wrap(~names,scales="free")

AIC vs BIC Plot

Putting plots together

What do these plots tell us?

These plots are supposed to give us a set of candidate models that we can inspect further.
What they do not tell us is whether there is a model worth inspecting!
There is no model really worth building in this case!
This is why cross validation is important but can not be done in every case.
- Impractical
- CV might destroy the dependence structure in a model (spatial). Workarounds do exist.

What are our “best” models

The model that maximizes adjusted \(r^2\) is the one when index of \(run_ar2\) object

run_ar2[which.max(ar2)]

[1] 499

The models that minimizes AIC and BIC on the other hand are when index of the objects are:

#AIC
run_ic[which.min(aic)]

[1] 400

#BIC
run_ic[which.min(bic)]

[1] 1

Discussing Index Results

Under certain assumptions AIC should lead to best predictive models.
Under certain assumptions BIC should converge to the true model.
Neither of the indices is superior and in the case of no relationship between target and features, not appropriate.

Discussing Index Results

Choice by AIC does not consider sample size and the penalty term of BIC is larger as long as \(log(n)>2\)
If we ignore the case where the number of features is equivalent to the number of observations
I will have to figure that case out however it probably has something to do with the likelihood not being a probabilistic function at that point.
We can see that BIC was able to capture the true model by suggesting that the model with the least number of independent variables are the best.

A correlated feature structure - Crux

We will insert a feature at column K of the feature matrix which is correlated with the target.

X=matrix(nrow = N,ncol=K,data=rnorm(N*K,0,1))
X[,K]=Y/0.7+rnorm(N,0,1)

A correlated feature structure - Code

N=500
K=N-1
L=N-250
#Target Variable
Y=rnorm(n=N,mean=0,sd=1)
#Features
X=matrix(nrow = N,ncol=K,data=rnorm(N*K,0,1))
X[,1]=Y/0.7+rnorm(N,0,1)
  
#Information Arrays
aic=array(dim=c(N))
bic=array(dim=c(N))
aicc=array(dim=c(N))
run=array(dim=c(N))

#We will not obtain information index for this.
object=glm(Y~X,family=gaussian)
sum_object=summary(object)

A correlated feature structure - Code

for(l in 1:(L-1)){
  object=glm(Y~X[,(1:l)],family=gaussian)
  sum_object=summary(object)
  
  aic[l+1]=sum_object$aic
  aicc[l+1]=AICc(object)
  bic[l+1]=extractAIC(object,k=log(nrow(X)))[2]
  run[l+1]=l+1
}

object=glm(Y~1,family=gaussian)
sum_object=summary(object)
aic[N]=sum_object$aic
aicc[N]=AICc(object)
bic[N]=extractAIC(object,k=log(nrow(X)))[2]
run[N]=1

#Create a dataframe for plots
aic_plot=as.data.frame(cbind(run,aic,aicc,bic))
#Name the columns
names(aic_plot)=c("run","aic","aicc","bic")
#For plotting make the dataframe longer
vaic_plot=pivot_longer(aic_plot,cols=c("aic","aicc", "bic"),
names_to="names",values_to = "values")

A correlated feature structure - Code

Simulating AIC vs BIC cross validation

We will have to set certain parameters
n=500, K=N-1 L=N-1
\(n^{Train}=333, n^{Test}=167\)
Run all models for each K
Pick the model suggested by AIC and BIC
Obtain \(MSE^{Test}\) by using coefficients from Training.
Compare distribution of \(MSE^{Test}\)
Repeat for SIM times

Simulations 1

SIM=50
N=500
K=round(N*(2/3)-1,0)

MSE_aic_array=array(dim=c(SIM))
MSE_bic_array=array(dim=c(SIM))
MSE_true_array=array(dim=c(SIM))
aic=as.data.frame(matrix(nrow=K,ncol=SIM))
aicc=as.data.frame(matrix(nrow=K,ncol=SIM))
bic=as.data.frame(matrix(nrow=K,ncol=SIM))
true=as.data.frame(matrix(nrow=K,ncol=SIM))
run=array(dim=c(K))
num_col_rem_aic=array(dim=c(SIM))
num_col_rem_bic=array(dim=c(SIM))

Simulations 2

for(sim in 1:SIM){
  #Y=rnorm(n=N,mean=0,sd=1)
  X=matrix(nrow = N,ncol=K,data=rnorm(N*K,0,1))
  Y= 0.7*X[,1]+rnorm(dim(X)[1],0,1)
    
  row_indices=sample(c(1:N),K+1)
  Y_train=Y[row_indices]
  Y_test=Y[-row_indices]
  X_train=X[row_indices,]
  X_test=X[-row_indices,]

  object_train=glm(Y_train~1,family=gaussian)
  sum_object_train=summary(object_train)
  aic[1,sim]=sum_object_train$aic
bic[1,sim]=extractAIC(object_train,k=log(nrow(X_train)))[2]
  run[1]=1    
    
  for(l in 1:(L-1)){
    object_train=glm(Y_train~X_train[,c(1:l)],family=gaussian)
    sum_object_train=summary(object_train)
    aic[l+1,sim]=sum_object_train$aic
    aicc[l+1,sim]=AICc(object_train)
    bic[l+1,sim]=extractAIC(object_train,k=log(nrow(X_train)))[2]
    run[l+1]=l+1
  }

num_col_rem_aic[sim]= which.min(aic[,sim])
num_col_rem_bic[sim]= which.min(bic[,sim])

if(which.min(aic[,sim])>1){ 
  object_aic_train=glm(Y_train~X_train[,c(1:(which.min(aic[,sim]))-1)],family=gaussian)
  Pred_test_aic=cbind(1,X_test[,c(1:(which.min(aic[,sim]))-1)])%*%coef(object_aic_train)
  MSE_aic_array[sim]=mean((Y_test-Pred_test_aic)^2)
}

if(which.min(bic[,sim])>1){ 
    object_bic_train=
glm(Y_train~X_train[,c(1:(which.min(bic[,sim]))-1)],family=gaussian)
    Pred_test_bic=cbind(1,X_test[,c(1:(which.min(bic[,sim]))-1)])%*%coef(object_bic_train)
    MSE_bic_array[sim]=mean((Y_test-Pred_test_bic)^2)
  }

if(which.min(bic[,sim])==1){ 
    object_bic_train=glm(Y_train~1,family=gaussian)
    Pred_test_bic=rep(1,length(Y_test))*coef(object_bic_train)
    MSE_bic_array[sim]=mean((Y_test-Pred_test_bic)^2)
  }
  
if(which.min(aic[,sim])==1){ 
    object_aic_train=glm(Y_train~1,family=gaussian)
    Pred_test_aic=rep(1,length(Y_test))*coef(object_aic_train)
    MSE_aic_array[sim]=mean((Y_test-Pred_test_aic)^2)
}
  
  object_true_train=glm(Y_train~X_train[,1],family=gaussian)
  Pred_true_test=rep(1,length(Y_test))*0+ 0.7*X_test[,1]
  MSE_true_array[sim]=mean((Y_test-Pred_true_test)^2)
}

MSE_aic_true_bic=as.data.frame(cbind(MSE_aic_array,MSE_bic_array,MSE_true_array,c(1:SIM)))
names(MSE_aic_true_bic)=c("MSE_aic","MSE_bic","MSE_true","SIM")
MSE_sims_long=pivot_longer(MSE_aic_true_bic,cols=c("MSE_aic","MSE_bic","MSE_true"))

Simulations Plot

Simulations Output

sum(MSE_aic_true_bic$MSE_aic>MSE_aic_true_bic$MSE_bic)/nrow(MSE_aic_true_bic)

[1] 0.38

sum(num_col_rem_aic==1)

[1] 0

sum(num_col_rem_aic==2)

[1] 28

sum(num_col_rem_bic==1)

[1] 0

sum(num_col_rem_bic==2)

[1] 50