We blog the daily life of graduate students.

Making LSF job submitting nicer

I use bsub often enough to hate specifying the -R,-o,-n and -q switches and stdout and stderr capturing every time. I'm sharing here a little goodie with you that takes away some of the trouble: nsub (nice sub)!

#!/bin/bash
 
if [ $# -lt 3 ]; then
  echo "USAGE: nsub jobname cmd"
fi
 
bsub -o"${1}.bsub.out" \
-J"${1}" \
-qlong \
-R"$RES" \
-n4 \
"${2} ${@:3} > ${1}.log 2> ${1}.err"

You also need to specify default resource usage in an environment variable $RES (put this in your .profile if you're using bash):

RES="select[type==X86_64 && \
ncpus>=4 && mem>= 700] \
span[hosts=1] \
rusage[mem=700]"

Here's an example showing how to use it:

for f in `ls *.bzip.fasta`; do 
nsub nminfer "$f" -numMotifs 1 \
-minLength 3 -maxLength 15 \
-backgroundModel human.100bg.500seq.msg.xml \
-seqs "$f" \
-out "${f}.xms"; done

This outputs the following files without me having to worry about output or error log naming or the bsub output file naming:

ls -l *.err *.log *.bsub.out
-rw-r--r--  1 mp4 uucp   4483 May 16 14:35 human.100bp.500seq.0.05spike.bzip.fasta.bsub.out
-rw-r--r--  1 mp4 uucp      0 May 16 14:19 human.100bp.500seq.0.05spike.bzip.fasta.err
-rw-r--r--  1 mp4 uucp 140085 May 16 14:35 human.100bp.500seq.0.05spike.bzip.fasta.log
-rw-r--r--  1 mp4 uucp   4466 May 16 14:35 human.100bp.500seq.0.1spike.bzip.fasta.bsub.out
-rw-r--r--  1 mp4 uucp      0 May 16 14:19 human.100bp.500seq.0.1spike.bzip.fasta.err
-rw-r--r--  1 mp4 uucp 143839 May 16 14:35 human.100bp.500seq.0.1spike.bzip.fasta.log
-rw-r--r--  1 mp4 uucp   4466 May 16 14:35 human.100bp.500seq.0.2spike.bzip.fasta.bsub.out
-rw-r--r--  1 mp4 uucp      0 May 16 14:19 human.100bp.500seq.0.2spike.bzip.fasta.err
-rw-r--r--  1 mp4 uucp 141419 May 16 14:35 human.100bp.500seq.0.2spike.bzip.fasta.log
-rw-r--r--  1 mp4 uucp   4466 May 16 14:35 human.100bp.500seq.0.3spike.bzip.fasta.bsub.out
-rw-r--r--  1 mp4 uucp      0 May 16 14:19 human.100bp.500seq.0.3spike.bzip.fasta.err
-rw-r--r--  1 mp4 uucp 138307 May 16 14:35 human.100bp.500seq.0.3spike.bzip.fasta.log
-rw-r--r--  1 mp4 uucp   4466 May 16 14:35 human.100bp.500seq.0.4spike.bzip.fasta.bsub.out
-rw-r--r--  1 mp4 uucp      0 May 16 14:19 human.100bp.500seq.0.4spike.bzip.fasta.err
-rw-r--r--  1 mp4 uucp 141111 May 16 14:35 human.100bp.500seq.0.4spike.bzip.fasta.log
-rw-r--r--  1 mp4 uucp   4466 May 16 14:35 human.100bp.500seq.0.5spike.bzip.fasta.bsub.out
-rw-r--r--  1 mp4 uucp      0 May 16 14:19 human.100bp.500seq.0.5spike.bzip.fasta.err
-rw-r--r--  1 mp4 uucp 138433 May 16 14:35 human.100bp.500seq.0.5spike.bzip.fasta.log

Jetpacks, enzymes and jetpacks!

You spend years waiting for someone to invent a jetpack and then this comes along!

http://news.bbc.co.uk/1/hi/world/7402016.stm

In other news, this is an awesome paper in the current issue of Nature. An enzyme to catalyse a non-biological reaction designed computationally, synthesised and then fine tuned by directed evolution. I'm not actually sure if this is the first example, but it's the first I've heard of. These guys must have some big brains, and some serious computers.

Reactions in synthetic organic chemistry such as the one in this paper (removal of a proton from a carbon atom and its effective transfer to another part of the molecule) are generally pretty tricky - in my hands at least! This is reflected by the fact that a lot of reaction schemes are named after the guy who worked it out, presumably in recognition of their creativity; the one in the paper is called the Kemp elimination. Considering how to accelerate a reaction (or make it possible at all) involves thinking about the transition state(s) - a transient state (TS) somewhere between products and reactants. This will often have a different shape or charge distribution. Enzymes contain active sites which fit, bind and stabilise the TS, which favours progress of the reaction.

Unfortunately, nature is only really interested in making things like amino acids, nucleotides and lipids rather than oil, plastic, cars, money etc., so it hasn't designed (via evolution) any enzymes to do these things. It should be possible to design enzymes to do a wider range of reactions by considering the shape of the TS for the relevant reaction and the properties of an active site that would fit it, but this requires big computers to do the search, as well as a good understanding of the possible shapes proteins can take.

I guess we're getting there, which to me is astonishing. These guys designed their active site taking into account shape, charge, and even positioning a side chain to change the pKa of one of the other bases in the active site. Once they worked out protein backbone sequences (in 3D) that would support this site, they synthesised corresponding genes, whacked them into bacteria and got functional enzymes out. Then they randomly mutagenise their initial genes and look for increased activity to get an even better enzyme, with activity similar to conventional organic catalysis. Ideally enzymes should outperform organic catalysts and I'd hope to see that in future, but it's early days - very exciting. It even involves computers ;) But not jetpacks.

Paper: Röthlisberger et al (2008) Kemp elimination catalysis by computational enzyme design. Nature 453:190-5
http://www.nature.com/nature/journal/v453/n7192/full/nature06879.html

Papers

Speaking of useful tools for scientists, one that I really have learnt to like is Papers. Papers is what I would call the iTunes of bibliography management software; It's not the most feature rich of them all but is simple and very intuitive to use and when you have to deal with collecting papers without it you tend to feel a bit lost.

Two of the particularly great features about it are how it handles importing and organising the papers. Importing is either done through dragging & dropping a PDF or a citation file into the application (finds title, authors etc automatically from either) or from various search and import filters (e.g. PubMed, Connotea). It also knows where to download papers in case you import them from, say, PubMed.

When it comes to organising papers, you handle them like songs in iTunes by throwing them in collections and automatic collections (think playlists and automatic playlists in iTunes).

All in all this wonderful (and cheap) application makes reading papers just that little bit easier and more fun!

Biocompass: get your bioinformatics direction

When I was at the predoc course in Heidelberg I saw that a lot of my friends, real experimental biologists, where a little bit confused in which tool use to do a bioinformatic job and where to grab it.

On top of that there was another problem that was related to the big number of bioinformatics software available out there and how to make a choice between them.

So I come up with the idea of the biocompass that I will try to explain.

Biocompass will be a web-portal where scientists can ask question about specific bioinformatics topic and other user can try to give an answer, giving the idea where it's worth to look.

Let me try to explain it with some examples:

Sarah is an experimental biologist and she has a sequence of a DNA. She want to know if this sequence is a gene and, in that case, if it function is known. However she doesn't know how to do it.

She comes to biocompass web-portal and post the question over there. The systems checks if there are similar questions already asked, if they are the system is going to show the first 5 hits that are similar to the problem proposed by Sarah.

If Sarah is satisfied with it, she will accept the answer and the system will store original question into the database for further reference and to increase the precision of the system.

In the other case Sarah decide to post the question to the attention of the other people subscribed to the portal.

The system will ask her to categorize the question into one of the available categories, like for example genetic, molecular dynamic, philogenetic, simulation, ...

She will be guided through an easy but systematica process to end up with a really well organized entry.

The system will propose a set of tags already present in the database to atick them to the question, to have a more fine-grained search later on.

Now comes another important innovation about biocompass, the filtering. Nowadays there is too much information that cannot be processed in a reasonable amount of time. That's why we need system to filter and select only the information that we really want to know.

Andrea is a bioinformatician working on genetic alignment. She has subscribed to biocompass and she decided to follow the genetic category only. This means that she is going to receive update only when there is a new question in the genetic category, ignoring all the others; they are not is field and she cannot be helpful over there.

Andrea sees the new question coming. She is following the new questions with a RSS feed that she prefer over the mail system.

She knows the answer. A BLAST search can give an hint about the DNA sequence. She suggest to use BLAST as a starting point for the research, linking it to the BLAST page entry.

This is an internal page in biocompass, where there is a small description about the software, the link to the ufficial website, and the number of user that have found the tool useful or not to solve this problem.

The rating is given by other user that has found the software good to solve this kind of problem, giving an idea how good and useful is the software itself.

The system shows also a bunch of related softwares that can be useful to solve this kind of question (ClustulW, BLAT, ...)

Sarah gets the update from biocompass with the answer coming from Andrea. She now can start the investigation about her DNA sequence.

Here we are. Biocompass has act as a real compass, to give the direction where to start. This time the needle was pointing to BLAST direction, next time is going to point somewhere else.

Squeezing liposomes

Time for some biology! I just saw this paper which deals with a subject I enjoy reading about:

DOI (still in advance publication): 10.1126/science.1154520

The process of dividing a cell involves generation of various forces - the chromosomes have to be positioned correctly so that they segregate evenly, as well as be pulled towards the daughter cells later on. The membrane also has to contract and separate down the middle to complete the division.

Experiments that investigate these forces are some of the most elegant in molecular biology. A good example is the proof that chromosomes must be under tension via microtubules binding at the kinetochores before they separate - this was shown by taking cells arrested due to one unattached kinetochore, sticking a tiny needle in and 'tugging' the chromosome in the right direction - the cells then undergo mitosis! (http://www.nature.com/nature/journal/v373/n6515/abs/373630a0.html)

This paper isn't in the same league technically, but it's still nice. They add one of the proteins involved in the actual division of the cell to some artificial lipid membrane tubes, and show that it spontaneously forms rings around the tube which contract when given GTP. I wonder how far they can take this system - I think it would be cool if they can get all the way to fission by adding a few more components. As they note at the end of the paper, this is probably how the earliest forms of life looked.

SJP

from the daily life....

blinding flash of the obvious - when doing permutation testing, it's enough to keep only the results that are more significant than the real ones, and only do permutations for as long as you are not sure the result is not significant.

Still need to do a lot of permutations for the significant findings, but for all else - let's just say that the users of the 'long' queue in the computing farm will be happy :)

Graduate student are enterpreneurial animals

A completed PhD project makes one a true expert on some well or not so well defined area of research. Several years of hard, independent work is eventually tested by writing up a bulky thesis and defending it in a viva. This is demanding and fun hard core education. Graduate life however offers many more exciting things to try and skills to learn.

I just discovered one untapped source of fun and challenge by attending the Enterprisers 2008 programme organised by the Cambridge-MIT Institute in Cambridge at the Judge Business School.

In short, Enterprisers is an intense 4-day course that teaches enterpreneurial thinking, team working skills, and gives plenty of chances for networking (as in drinking beer) with other students and new enterpreneurs. This year it happened to be run in Cambridge, but is unlikely to return here in the near future.

By attending the course you learn among other things answers to questions like "What role do you take naturally when presented with a hard task in an unknown group of people?" or "How do you adapt your role when you find out more about the group?" You also learn techniques to force creative thinking out of a group, and of course meet a whole lot of fun and smart people working on anything from philosophy to engineering to particle physics. You might also get some grasp of how to start building your own hazy business idea (surely everybody has one or two?) into something a little bit more realistic.

For those of you studying in Cambridge, I honestly recommend attending events and courses that the Cambridge-MIT Institute and the Centre for Enterpreneurial Learning organise. Among the things they run regularly are the termly i-Teams projects (student-run market research projects on new high-tech business ideas) and the Enterprise Tuesdays (a weekly lecture + networking event).

Vesicle coats

On to next chapter in Alberts et al..

Adidas clearly stole their 'Fevernova' logo

from the clathrin triskelion

And football players (or Plato?) the truncated icosahedron ball shape

from the clathrin coat

Go Nature in beating humans in making pretty things :)

Model selection - Information criteria, part II

Now for the hardcore information criteria part :) The first one can be found here.

The goal is still the same - pick a model to maximize the log-likelihood of the data. This is given by , where is a -dimensional parameter vector. We can approximate the integral with a Laplace approximation, which is similar in idea to the previous post - the probability mass will be centered around the mode of the distribution. We can fit a normal distribution with the mode as mean, and variance approximated from Taylor expansion at the mode. Next 2 paragraphs can be skipped if you believe this :)

For example, to approximate a function that has a mode (and thus a local maximum) at , we use the 2nd order Taylor:

(the first order term is 0 because of the local maximum)

Taking as the negative of the second derivative matrix, we get If we are looking for a probability distribution that is proportional to , we have as the mean, as the covariance matrix, and as the normalizing coefficient - voila!

So we can fit a Gaussian to a function - back to information criteria. We'll fit a Gaussian to at the mode (with the most likely parameter setting) :


As before, the first term is the fit of the model to the data. The rest of the terms are the complexity penalty. The second term is small (if we assume a wide prior), and the last term scales with - the main penalty comes from

To evaluate the determinant of the covariance matrix, we assume that it has full rank, and is due to iid data points. This means that is the sum of variances due to the data points, and since the data is iid, . So . Again, last term is constant, so all in all we have

To recap, we estimated the probability of the data under the model, using the Laplace approximation to fit a Gaussian for the log-likelihood, and used some simplifying assumptions to arrive at the final form.

The end result is pretty much the Bayesian Information Criterion, and it penalizes model complexity more than AIC. Note that the constants in front are not arbitrary, since we never made any simplifications for them, and there's a 2:1 ratio. This one was for you Matti :)

Ruby for bioinformatics I

Ruby is a great programming language for developing web applications with the Ruby on Rails framework. What most people might not have realised yet is that this simple object-oriented scripting language also has a great potential in bioinformatics.

I came up with the the following points to summarise the strengths of Ruby for bioinformaticians:

  1. Ruby is simple, elegant and easy. The description at ruby-lang.org puts it well:[Ruby is] a dynamic, open source programming language with a focus on simplicity and productivity. It has an elegant syntax that is natural to read and easy to write.
  2. There are lots of useful libraries available in a public repository at RubyForge (2978 and counting). Bioinformatics geeks might for example like BioRuby (self-explanatory), Ensembl(the Ensembl Core API for Ruby), RSRuby (R-Ruby bridge) and of course ActiveRecord (a powerful database access and persistence API that's part of Ruby on Rails).
  3. It's dead-easy to install both the interpreter and extra libraries in both Windows, Linux and OS X. This is because Ruby has a nice package manager called RubyGems. Unlike the hair loss, heart burn and depression inducing CPAN in the Perl world, RubyGems actually Just Works and handles dependencies and uninstalling.
  4. Getting started is easy because there are excellent learning resources available online. I personally got started using the online tutorial Try Ruby!, then went on to read the online edition of Programming Ruby, after which googling has been enough to solve most of my Ruby issues.

Below is a quick taster of using the Ensembl API in Ruby. The example is grabbed from the Ensembl tutorial included with the Ruby Ensembl API docs.

require 'ensembl-api'
include Ensembl::Core
 
slice = Slice.fetch_by_region('chromosome','X',1000000,10000000)
puts slice.display_name
 
slice.genes.each do |gene|
  puts "\t" + gene.stable_id
 
  gene.transcripts.each do |transcript|
    puts "\t\t" + transcript.stable_id
 
    transcript.exons.each do |exon|
      puts "\t\t\t" + exon.id.to_s
  end
end