Matias's blog

The power of... poop?

This entry in the Science magazine news blog outlines a pretty strange motive for someone to go for a career in science: The Power of Poop.

[...] For marine scientist Ellen Prager (left), the moment came 25 years ago when she took a summer job helping researchers who spent a lot of time swimming behind parrotfish with bags in hand to collect poop as it plopped out of their bottoms. "If that was science, I thought to myself, I could do it too," Prager told the audience.

Bit matrix

I have a sequence analysis application which has to keep several thousand boolean matrices of dimensions up to ~ 200x2000 in memory. These matrices contain majority of the memory footprint of the application and essentially limits its use with very large sequence sets and with estimating more complicated models from those sequences.

Our current boolean matrix implementation is a boolean array accessed through a simple 2D matrix API. However, there's a large memory overhead in this because of booleans taking one byte in Java (according to the JVM specification) when in arrays (as attributes/variables they take 4 bytes). In other words almost 7/8 of memory stored with these matrices is wasted. How does one save it?

Well, as Leo helpfully pointed out to me yesterday, you can store the individual bits as part of integers and read them using bit shifting and mod 2. Brilliant! So I made an implementation of this idea yesterday evening and from a quick and dirty test it looks as though I even got it right on the first time!

Note that you can also do this with bytes or shorts but I decided to just use integers because then I don't have to worry about casting the indices in the get and set operations. The extra bits wasted in the end of the last 32 bits of the int array is not a major issue in my use case either (negligible memory waste when compared to wasting that 7/8 of all bytes spent).

Enjoy the source code:

public class BitMatrix2D implements Serializable {
    private static final long 
    serialVersionUID = 29187494607829404L;
 
    private final int rows;
    private final int columns;
    private int[] data;
 
    public BitMatrix2D(int rows, int columns) {
        this.rows = rows;
        this.columns = columns;
 
        /* x >> 5 = x / 32, only faster */
        this.data = new int[Math.max(1,(rows * columns + 1)>> 5)];
    }
 
    public int rows() {
        return rows;
    }
 
    public int columns() {
        return columns;
    }
 
    public boolean get(int row, int col) {
    	if (row < 0 || row >= rows)
            throw 
                new IndexOutOfBoundsException(
                "Row index out of bounds:" + row);
    	else if (col < 0 || col >= columns)
            throw 
                new IndexOutOfBoundsException(
                "Column index out of bounds:" + col);
 
    	int i = row * columns + col;
    	return ( (data[i >> 5] >> (i % 32)) & 1 ) != 0;
    }
 
    public void set(int row, int col, boolean v) {
    	if (row < 0 || row >= rows)
            throw 
                new IndexOutOfBoundsException(
                "Row index out of bounds:" + row);
    	else if (col < 0 || col >= columns)
            throw 
                new IndexOutOfBoundsException(
                "Column index out of bounds:" + col);
 
    	int i = row * columns + col;
    	int idiv32 = i >> 5;
    	int modBit = 1 << (i % 32);
    	data[idiv32] = v ? data[idiv32] | 
             modBit : data[idiv32] & ~modBit;
    }
}

Fragment database for protein backbone construction

Fragment based protein fold prediction has been the most successful strategy in the latest CASP7 protein fold assessment experiment. Methods relying on fragment libraries also show great promise in designing new protein folds as Steve pointed out with his "Jetpacks, Enzymes And Jetpacks!" post (the Nature paper with the computationally designed and then optimised fold). However, what's been missing is a good quality public fragment library for any protein fold predicting group to experiment with (the CASP winning lot are naturally not giving theirs out).

This week's PLoS Computational Biology describes exactly such a fragment library: "Reconstruction of Protein Backbones from the BriX Collection of Canonical Protein Fragments" by Baeten et al. This is exciting stuff and could well heat up the competition in CASP8!

Evolution Graffiti

Isn't this cool?

Making LSF job submitting nicer

I use bsub often enough to hate specifying the -R,-o,-n and -q switches and stdout and stderr capturing every time. I'm sharing here a little goodie with you that takes away some of the trouble: nsub (nice sub)!

#!/bin/bash
 
if [ $# -lt 3 ]; then
  echo "USAGE: nsub jobname cmd"
fi
 
bsub -o"${1}.bsub.out" \
-J"${1}" \
-qlong \
-R"$RES" \
-n4 \
"${2} ${@:3} > ${1}.log 2> ${1}.err"

You also need to specify default resource usage in an environment variable $RES (put this in your .profile if you're using bash):

RES="select[type==X86_64 && \
ncpus>=4 && mem>= 700] \
span[hosts=1] \
rusage[mem=700]"
export RES

Here's an example showing how to use it:

for f in `ls *.bzip.fasta`; do 
nsub nminfer "$f" -numMotifs 1 \
-minLength 3 -maxLength 15 \
-backgroundModel human.100bg.500seq.msg.xml \
-seqs "$f" \
-out "${f}.xms"; done

This outputs the following files without me having to worry about output or error log naming or the bsub output file naming:

ls -l *.err *.log *.bsub.out
-rw-r--r--  1 mp4 uucp   4483 May 16 14:35 human.100bp.500seq.0.05spike.bzip.fasta.bsub.out
-rw-r--r--  1 mp4 uucp      0 May 16 14:19 human.100bp.500seq.0.05spike.bzip.fasta.err
-rw-r--r--  1 mp4 uucp 140085 May 16 14:35 human.100bp.500seq.0.05spike.bzip.fasta.log
-rw-r--r--  1 mp4 uucp   4466 May 16 14:35 human.100bp.500seq.0.1spike.bzip.fasta.bsub.out
-rw-r--r--  1 mp4 uucp      0 May 16 14:19 human.100bp.500seq.0.1spike.bzip.fasta.err
-rw-r--r--  1 mp4 uucp 143839 May 16 14:35 human.100bp.500seq.0.1spike.bzip.fasta.log
-rw-r--r--  1 mp4 uucp   4466 May 16 14:35 human.100bp.500seq.0.2spike.bzip.fasta.bsub.out
-rw-r--r--  1 mp4 uucp      0 May 16 14:19 human.100bp.500seq.0.2spike.bzip.fasta.err
-rw-r--r--  1 mp4 uucp 141419 May 16 14:35 human.100bp.500seq.0.2spike.bzip.fasta.log
-rw-r--r--  1 mp4 uucp   4466 May 16 14:35 human.100bp.500seq.0.3spike.bzip.fasta.bsub.out
-rw-r--r--  1 mp4 uucp      0 May 16 14:19 human.100bp.500seq.0.3spike.bzip.fasta.err
-rw-r--r--  1 mp4 uucp 138307 May 16 14:35 human.100bp.500seq.0.3spike.bzip.fasta.log
-rw-r--r--  1 mp4 uucp   4466 May 16 14:35 human.100bp.500seq.0.4spike.bzip.fasta.bsub.out
-rw-r--r--  1 mp4 uucp      0 May 16 14:19 human.100bp.500seq.0.4spike.bzip.fasta.err
-rw-r--r--  1 mp4 uucp 141111 May 16 14:35 human.100bp.500seq.0.4spike.bzip.fasta.log
-rw-r--r--  1 mp4 uucp   4466 May 16 14:35 human.100bp.500seq.0.5spike.bzip.fasta.bsub.out
-rw-r--r--  1 mp4 uucp      0 May 16 14:19 human.100bp.500seq.0.5spike.bzip.fasta.err
-rw-r--r--  1 mp4 uucp 138433 May 16 14:35 human.100bp.500seq.0.5spike.bzip.fasta.log

Papers

Speaking of useful tools for scientists, one that I really have learnt to like is Papers. Papers is what I would call the iTunes of bibliography management software; It's not the most feature rich of them all but is simple and very intuitive to use and when you have to deal with collecting papers without it you tend to feel a bit lost.

Two of the particularly great features about it are how it handles importing and organising the papers. Importing is either done through dragging & dropping a PDF or a citation file into the application (finds title, authors etc automatically from either) or from various search and import filters (e.g. PubMed, Connotea). It also knows where to download papers in case you import them from, say, PubMed.

When it comes to organising papers, you handle them like songs in iTunes by throwing them in collections and automatic collections (think playlists and automatic playlists in iTunes).

All in all this wonderful (and cheap) application makes reading papers just that little bit easier and more fun!

Graduate student are enterpreneurial animals

A completed PhD project makes one a true expert on some well or not so well defined area of research. Several years of hard, independent work is eventually tested by writing up a bulky thesis and defending it in a viva. This is demanding and fun hard core education. Graduate life however offers many more exciting things to try and skills to learn.

I just discovered one untapped source of fun and challenge by attending the Enterprisers 2008 programme organised by the Cambridge-MIT Institute in Cambridge at the Judge Business School.

In short, Enterprisers is an intense 4-day course that teaches enterpreneurial thinking, team working skills, and gives plenty of chances for networking (as in drinking beer) with other students and new enterpreneurs. This year it happened to be run in Cambridge, but is unlikely to return here in the near future.

By attending the course you learn among other things answers to questions like "What role do you take naturally when presented with a hard task in an unknown group of people?" or "How do you adapt your role when you find out more about the group?" You also learn techniques to force creative thinking out of a group, and of course meet a whole lot of fun and smart people working on anything from philosophy to engineering to particle physics. You might also get some grasp of how to start building your own hazy business idea (surely everybody has one or two?) into something a little bit more realistic.

For those of you studying in Cambridge, I honestly recommend attending events and courses that the Cambridge-MIT Institute and the Centre for Enterpreneurial Learning organise. Among the things they run regularly are the termly i-Teams projects (student-run market research projects on new high-tech business ideas) and the Enterprise Tuesdays (a weekly lecture + networking event).

Ruby for bioinformatics I

Ruby is a great programming language for developing web applications with the Ruby on Rails framework. What most people might not have realised yet is that this simple object-oriented scripting language also has a great potential in bioinformatics.

I came up with the the following points to summarise the strengths of Ruby for bioinformaticians:

  1. Ruby is simple, elegant and easy. The description at ruby-lang.org puts it well:[Ruby is] a dynamic, open source programming language with a focus on simplicity and productivity. It has an elegant syntax that is natural to read and easy to write.
  2. There are lots of useful libraries available in a public repository at RubyForge (2978 and counting). Bioinformatics geeks might for example like BioRuby (self-explanatory), Ensembl(the Ensembl Core API for Ruby), RSRuby (R-Ruby bridge) and of course ActiveRecord (a powerful database access and persistence API that's part of Ruby on Rails).
  3. It's dead-easy to install both the interpreter and extra libraries in both Windows, Linux and OS X. This is because Ruby has a nice package manager called RubyGems. Unlike the hair loss, heart burn and depression inducing CPAN in the Perl world, RubyGems actually Just Works and handles dependencies and uninstalling.
  4. Getting started is easy because there are excellent learning resources available online. I personally got started using the online tutorial Try Ruby!, then went on to read the online edition of Programming Ruby, after which googling has been enough to solve most of my Ruby issues.

Below is a quick taster of using the Ensembl API in Ruby. The example is grabbed from the Ensembl tutorial included with the Ruby Ensembl API docs.

require 'ensembl-api'
include Ensembl::Core
 
slice = Slice.fetch_by_region('chromosome','X',1000000,10000000)
puts slice.display_name
 
slice.genes.each do |gene|
  puts "\t" + gene.stable_id
 
  gene.transcripts.each do |transcript|
    puts "\t\t" + transcript.stable_id
 
    transcript.exons.each do |exon|
      puts "\t\t\t" + exon.id.to_s
    end
  end
end

Bezier curves crack me up

I just recently came back from a pretty disappointing systems biology meeting. I had fun and had some interesting conversations and met great people, but the scientific content of the submitted talks was really very variable. So variable in fact that I'll blog later about the concerns that this meeting raised in my mind about systems biology. I want to start off with something a bit more outrageous though: a wtf?!-style slide that was shown during one of the talks there.

The explanation given for this graph went along the lines of "... The expression dynamics of the system are clearly very complex ..." What? Are you joking? This graph is obviously nothing more than an example of the 8th deadly sin of overfitting.

Question: Can comeone over there in the Interweb come up with a valid use case for Bezier curve fitting in biology? I bet you can't.