“The Akaike information criterion: Background, derivation, properties, application, interpretation, and refinements” Joseph E. Cavanaugh, Andrew A. Neath, 2019
Topics of Discussion
An important topic to discuss in predictive analytics is the trade off between model complexity and model fit.
This is the mean of the measure in brackets under repeated realizations y generated under the true model \(g(y)\).
\(I(\theta_{k})\) is the Kullback-Leibler distance between the true generating model function \(g(\boldsymbol y)\) and candidate model’s function \(f(\boldsymbol y|\boldsymbol\theta_{k})\)
\[\begin{align}
2 I(\boldsymbol\theta_{k})= E \bigg[2 log [g(\boldsymbol y)] - 2 f(\boldsymbol y|\boldsymbol\theta_{k}) \bigg]
\end{align}\] - This can be rewritten as
Since there is no symmetry Kullback-Leibler is not distance.
Recall that the likelihood evaluates to a single value.
\(I(\boldsymbol\theta_{k})\) is a measure of expectation.
As the difference between \(g(y)\) and \(f(\boldsymbol y|\boldsymbol\theta_{k})\) grows \(I(\boldsymbol\theta_{k})\) also grows.
AIC Derivation IV
Since in the \(I(\boldsymbol\theta_{k})\) measure, the function \(g(y)\) does not depend on \(\theta\), when ranking candidate models we do not have to consider it.
Therefore \(d(\boldsymbol\theta_{k})\) is an index that can stand for \(I(\boldsymbol\theta_{k})\). and is referred to as Kullback discrepancy.
Even though the MLE for \((\boldsymbol\theta_{k})\), \((\boldsymbol{\hat\theta_{k}})\) can be used to estimate the candidate model’s likelihood, the expectation is with respect to \(g(y)\) and therefore \(d(\boldsymbol\theta_{k})\) is not available.
AIC Derivation V
\(-2 log f(\boldsymbol y|\hat{\boldsymbol\theta})\) is a negatively biased estimator of \(d(\boldsymbol{\hat\theta_{k}})\) because of using M.L.E.s \(\hat{\boldsymbol\theta}\) to substitute for \(\boldsymbol\theta\)
The \(2k\) becomes an adjustment factor to correct for this bias.
Different forms of \(f(\boldsymbol y| \hat{\boldsymbol \theta_{k}})\) can be compared but models that includes transformations of \(y\) can not be compared.
If different forms of \(f\) are compared we need to make sure to keep the constants.
k is the expected value of the \(\chi^2\) distribution that accounts for the bias due to M.L.E. being used instead of \(\boldsymbol \theta\)
If k is large and n is small AIC is negatively biased. That is why there are modifications, for instance AICc.
AIC will tend to select the model that minimizes out of sample error.
Simulations
library(tidyverse)library(ggpubr)library(MuMIn)#Sample Size.N=500#Number of independent variables.K=N-1#Create a target variable with no structure to X.Y=rnorm(n=N,mean=0,sd=1)X=matrix(nrow = N,ncol=K,data=rnorm(N*K,0,1))dim(X)
[1] 500 499
There is no relationship between any of the variables within X. Each of the N*K values are independently generated.
There is no relationship between Y and X.
Recalling measures of fit
\(r^2\) is percentage of variation you can explain in your response variable.
\(r^2=\frac{SSR}{SST}=1-\frac{SSE}{SST}\)
In other words if your \(r^2\) is 1 your sum of squared residuals is equal to 0.
Implying that your model does not make any errors involving your predictions.
Fit vs Model Complexity
The problem involving \(r^{2}\) is obvious
Measures such as adjusted \(r^{2}\), Akaike’s Information Criterion (AIC), Bayesian Information Criterion (BIC) are different indices that try to account for goodness of fit vs model complexity.
# Adjusted R2 is not defined due to denominator being 0sum_object$adj.r.squared
[1] NaN
Simulations
We will demonstrate how \(r^{2}\), adjusted \(r^2\), AIC and BIC behaves as we reduce the number of variables in the model.
Ideally what do we expect for the indices that account for model complexity?
We would like them to show that there is no value to the model to speak of, however we would expect these measures to relatively improve as we reduce the number of independent variables.
Simulations
# L is the number of columns to iteratantly remove in the for loop. L=N-1# L is also number of models that are going to be run. # Iteration Information is going to be stored. # Array sizes take this into account.MSE_array=array(dim=c(L+1))ar2=array(dim=c(L+1))r2=array(dim=c(L+1))aic=array(dim=c(L+1))bic=array(dim=c(L+1))# We will need the iteration index in plotting functions.run_ar2=array(dim=c(L+1))
Simulations-Explanation
We simulate L number of models in a for loop.
Each iteration removes one additional feature from the model.
In every iteration of the loop, we obtain information regarding adjusted \(r^2\), \(r^2\).
We would expect \(r^2\) to increase with number of features in the model while \(adj r^{2}\) would take into account number of features.
#Create a dataframe for plotsaic_plot=as.data.frame(cbind(run_ic,aic,bic))#Name the columnsnames(aic_plot)=c("run_ic","aic","bic")#For plotting make the dataframe longervaic_plot=pivot_longer( aic_plot,cols=c("aic","bic"),names_to="names",values_to ="values")ic=ggplot(data=vaic_plot,aes(x=run_ic,y=values))+geom_line()+scale_x_continuous("Num. Features",breaks=c(1,50,100,150,200,250,300,350,400,450,L))+facet_wrap(~names,scales="free")
AIC vs BIC Plot
Putting plots together
What do these plots tell us?
These plots are supposed to give us a set of candidate models that we can inspect further.
What they do not tell us is whether there is a model worth inspecting!
There is no model really worth building in this case!
This is why cross validation is important but can not be done in every case.
Impractical
CV might destroy the dependence structure in a model (spatial). Workarounds do exist.
What are our “best” models
The model that maximizes adjusted \(r^2\) is the one when index of \(run_ar2\) object
run_ar2[which.max(ar2)]
[1] 499
The models that minimizes AIC and BIC on the other hand are when index of the objects are:
#AICrun_ic[which.min(aic)]
[1] 400
#BICrun_ic[which.min(bic)]
[1] 1
Discussing Index Results
Under certain assumptions AIC should lead to best predictive models.
Under certain assumptions BIC should converge to the true model.
Neither of the indices is superior and in the case of no relationship between target and features, not appropriate.
Discussing Index Results
Choice by AIC does not consider sample size and the penalty term of BIC is larger as long as \(log(n)>2\)
If we ignore the case where the number of features is equivalent to the number of observations
I will have to figure that case out however it probably has something to do with the likelihood not being a probabilistic function at that point.
We can see that BIC was able to capture the true model by suggesting that the model with the least number of independent variables are the best.
A correlated feature structure - Crux
We will insert a feature at column K of the feature matrix which is correlated with the target.
N=500K=N-1L=N-250#Target VariableY=rnorm(n=N,mean=0,sd=1)#FeaturesX=matrix(nrow = N,ncol=K,data=rnorm(N*K,0,1))X[,1]=Y/0.7+rnorm(N,0,1)#Information Arraysaic=array(dim=c(N))bic=array(dim=c(N))aicc=array(dim=c(N))run=array(dim=c(N))#We will not obtain information index for this.object=glm(Y~X,family=gaussian)sum_object=summary(object)
A correlated feature structure - Code
for(l in1:(L-1)){ object=glm(Y~X[,(1:l)],family=gaussian) sum_object=summary(object) aic[l+1]=sum_object$aic aicc[l+1]=AICc(object) bic[l+1]=extractAIC(object,k=log(nrow(X)))[2] run[l+1]=l+1}object=glm(Y~1,family=gaussian)sum_object=summary(object)aic[N]=sum_object$aicaicc[N]=AICc(object)bic[N]=extractAIC(object,k=log(nrow(X)))[2]run[N]=1#Create a dataframe for plotsaic_plot=as.data.frame(cbind(run,aic,aicc,bic))#Name the columnsnames(aic_plot)=c("run","aic","aicc","bic")#For plotting make the dataframe longervaic_plot=pivot_longer(aic_plot,cols=c("aic","aicc", "bic"),names_to="names",values_to ="values")
A correlated feature structure - Code
Simulating AIC vs BIC cross validation
We will have to set certain parameters
n=500, K=N-1 L=N-1
\(n^{Train}=333, n^{Test}=167\)
Run all models for each K
Pick the model suggested by AIC and BIC
Obtain \(MSE^{Test}\) by using coefficients from Training.