Leo's blog

CCBI symposium

Today was the Cambridge Computational Biology Institute Annual Symposium. A day of talks by investigators across Cambridge, it gave a nice slice of quite different kinds of research being done here.

The highlight of the day for me was Duncan Odom's talk on conservation of transcriptional regulation. He talked about older work of conservation of regulation between human and mouse, about more recent work on a mouse with a human chromosome, as well as plans to extend the comparative approach to a range of mammals - cool stuff.

Also, Sarah Teichmann talked about evolution of homo- and heterodimers, and Máté Lengyel discussed computational models of learning (Bayesian model selection and Markov Decision Process, and showed experiments comparing them to human learning.

from the daily life....

blinding flash of the obvious - when doing permutation testing, it's enough to keep only the results that are more significant than the real ones, and only do permutations for as long as you are not sure the result is not significant.

Still need to do a lot of permutations for the significant findings, but for all else - let's just say that the users of the 'long' queue in the computing farm will be happy :)

Vesicle coats

On to next chapter in Alberts et al..

Adidas clearly stole their 'Fevernova' logo

from the clathrin triskelion

And football players (or Plato?) the truncated icosahedron ball shape

from the clathrin coat

Go Nature in beating humans in making pretty things :)

Model selection - Information criteria, part II

Now for the hardcore information criteria part :) The first one can be found here.

The goal is still the same - pick a model to maximize the log-likelihood of the data. This is given by , where is a -dimensional parameter vector. We can approximate the integral with a Laplace approximation, which is similar in idea to the previous post - the probability mass will be centered around the mode of the distribution. We can fit a normal distribution with the mode as mean, and variance approximated from Taylor expansion at the mode. Next 2 paragraphs can be skipped if you believe this :)

For example, to approximate a function that has a mode (and thus a local maximum) at , we use the 2nd order Taylor:

(the first order term is 0 because of the local maximum)

Taking as the negative of the second derivative matrix, we get If we are looking for a probability distribution that is proportional to , we have as the mean, as the covariance matrix, and as the normalizing coefficient - voila!

So we can fit a Gaussian to a function - back to information criteria. We'll fit a Gaussian to at the mode (with the most likely parameter setting) :


As before, the first term is the fit of the model to the data. The rest of the terms are the complexity penalty. The second term is small (if we assume a wide prior), and the last term scales with - the main penalty comes from

To evaluate the determinant of the covariance matrix, we assume that it has full rank, and is due to iid data points. This means that is the sum of variances due to the data points, and since the data is iid, . So . Again, last term is constant, so all in all we have

To recap, we estimated the probability of the data under the model, using the Laplace approximation to fit a Gaussian for the log-likelihood, and used some simplifying assumptions to arrive at the final form.

The end result is pretty much the Bayesian Information Criterion, and it penalizes model complexity more than AIC. Note that the constants in front are not arbitrary, since we never made any simplifications for them, and there's a 2:1 ratio. This one was for you Matti :)

Iterative parameter finding

I should be getting ready for my viva, but instead, I reread this cool bit on how to find the best parameters for a model.

You have a model with parameters , and you want to fit the parameters to the data . For the maximum log-likelihood, you would find the zero of the derivative . But suppose that is nontrivial. Then you can use the Newton-Rhapson method for iterative parameter finding.

Using the multivariate Taylor series, we can update the initial guess for the best parameters.

small terms

This gives

This iterative procedure is nice because the functions involved often have simple forms (e.g in logistic regression, is a combination of inputs that give classification errors), and it gives a solution in cases that are not analytically tractable.

Selecting a model for the data - information criteria, part I

We covered a paper by Schadt and others last week that dealt with selecting the best model for the data. We want to select the simplest model that explains the most data, and there is a tradeoff between model fit and complexity.

Intuitively it's obvious that there must be a tradeoff - too simple models can't explain all the data, and there are way too many complicated models to accurately choose the right one. In practice, people use various information criteria to formalize the tradeoff, usually of the form
where is the information criterion, is the probability of the data given the model, and is a penalty for model complexity, (aka the Occam factor).

Matti asked me last week what the rationale behind choosing is. I'll write about the 2 widely used forms (Akaike Information Criterion (AIC) this time, and Bayesian Information Criterion (BIC) the next), but don't really know the answer to the weights of compared to either :)

Here's the intuition for AIC (the idea is presented in both MacKay and Bishop books): we have a set of models , and want to select the one with highest probability after seeing the data . This is given by If we take prior over models to be uniform, we just need to evaluate the evidence for each model.

Pick one model , and say it has tunable parameters. Select one of them, , and let's assume it's prior distribution is flat with width

We have

Suppose is sharply peaked around , and the width of the peak is Then the probability of will be all from that region, and given by , since the integral drops to 0 outside the peak. Combining that with the prior, and taking the log we get

Now repeating the similar argument over all parameters (taking integrals), we get

The first part of the expression is the fit of the model of the data, and the second part is a linear penalty of the number of parameters, scaled by the log of the fold-difference between the size of the prior and posterior parameter space.

The Gaussian

I did not go deep enough in my math studies as an undergrad to get a glimpse of calculus of variations (link to notes from cam math dept, chapter 8). It looks fascinating, and can be used to find functions that are minimal w.r.t some constraints.

For example, calculus of variations can be used to show why the Gaussian is interesting. It's a limiting distribution of several families, sum of IID variables is approximately Gaussian - but it is also a distribution that conveys our ignorance of the data. Given some unknown distribution has mean and variance , normal distribution is the one that conveys the least extra information, i.e, it has the maximum entropy among all possible .

The way to show it involves calculus of variations and Lagrange multipliers. We encode the 3 conditions of the distribution function (integrates to 1, mean and variance given), and combine it with the entropy in the Lagrangian:

Now differentiating with respect to (differentiating w.r.t a function! :), and finding the maximum, we get