We covered a paper by Schadt and others last week that dealt with selecting the best model for the data. We want to select the simplest model that explains the most data, and there is a tradeoff between model fit and complexity.
Intuitively it's obvious that there must be a tradeoff - too simple models can't explain all the data, and there are way too many complicated models to accurately choose the right one. In practice, people use various information criteria to formalize the tradeoff, usually of the form
where
is the information criterion,
is the probability of the data given the model, and
is a penalty for model complexity, (aka the Occam factor).
Matti asked me last week what the rationale behind choosing is. I'll write about the 2 widely used forms (Akaike Information Criterion (AIC) this time, and Bayesian Information Criterion (BIC) the next), but don't really know the answer to the weights of
compared to
either :)
Here's the intuition for AIC (the idea is presented in both MacKay and Bishop books): we have a set of models , and want to select the one with highest probability
after seeing the data
. This is given by
If we take prior over models to be uniform, we just need to evaluate the evidence for each model.
Pick one model , and say it has
tunable parameters. Select one of them,
, and let's assume it's prior distribution is flat with width
We have
Suppose is sharply peaked around
, and the width of the peak is
Then the probability of
will be all from that region, and given by
, since the integral drops to 0 outside the peak. Combining that with the prior, and taking the log we get
Now repeating the similar argument over all parameters (taking integrals), we get
The first part of the expression is the fit of the model of the data, and the second part is a linear penalty of the number of parameters, scaled by the log of the fold-difference between the size of the prior and posterior parameter space.
Link to paper, please
Hey Leo, can you add a link to the Schadt et al paper, please?