Model selection - Information criteria, part II

Submitted by Leo on Sat, 05/03/2008 - 00:57

Now for the hardcore information criteria part :) The first one can be found here.

The goal is still the same - pick a model to maximize the log-likelihood of the data. This is given by , where is a -dimensional parameter vector. We can approximate the integral with a Laplace approximation, which is similar in idea to the previous post - the probability mass will be centered around the mode of the distribution. We can fit a normal distribution with the mode as mean, and variance approximated from Taylor expansion at the mode. Next 2 paragraphs can be skipped if you believe this :)

For example, to approximate a function that has a mode (and thus a local maximum) at , we use the 2nd order Taylor:

(the first order term is 0 because of the local maximum)

Taking as the negative of the second derivative matrix, we get If we are looking for a probability distribution that is proportional to , we have as the mean, as the covariance matrix, and as the normalizing coefficient - voila!

So we can fit a Gaussian to a function - back to information criteria. We'll fit a Gaussian to at the mode (with the most likely parameter setting) :


As before, the first term is the fit of the model to the data. The rest of the terms are the complexity penalty. The second term is small (if we assume a wide prior), and the last term scales with - the main penalty comes from

To evaluate the determinant of the covariance matrix, we assume that it has full rank, and is due to iid data points. This means that is the sum of variances due to the data points, and since the data is iid, . So . Again, last term is constant, so all in all we have

To recap, we estimated the probability of the data under the model, using the Laplace approximation to fit a Gaussian for the log-likelihood, and used some simplifying assumptions to arrive at the final form.

The end result is pretty much the Bayesian Information Criterion, and it penalizes model complexity more than AIC. Note that the constants in front are not arbitrary, since we never made any simplifications for them, and there's a 2:1 ratio. This one was for you Matti :)

typo

I assume there are typos here: "The a wide prior probability for the parameters, the second term is small, and the last term scales with - the main penalty comes from..."?

This second part went a little bit too hard core for me... I don't understand from this description what covariance matrix we're even talking about :) Can you expand or explain the beginning a little bit more as a comment, please.

If I understand anything from this, it is that the BIC doesn't make the assumption of the likelihood being a sharp peak at the MAP, whereas the BIC assumes a Gaussian to the likelihood function?

More on the Laplace approximation

Thanks for the typos and the feedback - in a large sense I'm writing these to think it through and remember it myself, so it may not be coherent :)

Instead of writing even more in the main text of the post - it's already really long - I'll just reply here.

For a Gaussian in 1 dimension, you have 2 numbers that describe it - mean and variance. In K dimensions, you have a vector of length K for the mean, and a K x K matrix as a covariance matrix. The general form of the Gaussian is then given by

The cool part here is the if you know p is a probability distribution, and you know the exponent form, you can immediately deduce the rest - the covariance matrix must be the middle one, the mean must be the subtracted vector, and the normalizing coefficient must come out to the right value. And if you need to integrate over all possible x for the function exp(-0.5*(x-m)...), you know that will come to the inverse of the normalizing constant.

In the actual approximation, we take the same formula as in the example, and fit the function to it - the hairy integral over all the parameters comes down to just the inverse of the normalizing constant, and the rest is just assumptions and interpretation.

I'm happy to do it on the paper instead :)

As for what BIC does - you can just think of it as approximating the model evidence. It does not assume a Gaussian distribution, but tries to approximate it by one. In cases where you do have one mode that is not very skewed, this should give good results - but of course it's not always the case.

If you drop some assumptions, you can get better approximations for the tougher cases. But if you can afford to do it, you should do model selection in a fully Bayesian way instead of using these hacks :)

nag 2

Some more nagging: can you add a link to the first entry in this series?