I have a sequence analysis application which has to keep several thousand boolean matrices of dimensions up to ~ 200x2000 in memory. These matrices contain majority of the memory footprint of the application and essentially limits its use with very large sequence sets and with estimating more complicated models from those sequences.
Our current boolean matrix implementation is a boolean array accessed through a simple 2D matrix API. However, there's a large memory overhead in this because of booleans taking one byte in Java (according to the JVM specification) when in arrays (as attributes/variables they take 4 bytes). In other words almost 7/8 of memory stored with these matrices is wasted. How does one save it?
Well, as Leo helpfully pointed out to me yesterday, you can store the individual bits as part of integers and read them using bit shifting and mod 2. Brilliant! So I made an implementation of this idea yesterday evening and from a quick and dirty test it looks as though I even got it right on the first time!
Note that you can also do this with bytes or shorts but I decided to just use integers because then I don't have to worry about casting the indices in the get and set operations. The extra bits wasted in the end of the last 32 bits of the int array is not a major issue in my use case either (negligible memory waste when compared to wasting that 7/8 of all bytes spent).
Enjoy the source code:
public class BitMatrix2D implements Serializable { private static final long serialVersionUID = 29187494607829404L; private final int rows; private final int columns; private int[] data; public BitMatrix2D(int rows, int columns) { this.rows = rows; this.columns = columns; /* x >> 5 = x / 32, only faster */ this.data = new int[Math.max(1,(rows * columns + 1)>> 5)]; } public int rows() { return rows; } public int columns() { return columns; } public boolean get(int row, int col) { if (row < 0 || row >= rows) throw new IndexOutOfBoundsException( "Row index out of bounds:" + row); else if (col < 0 || col >= columns) throw new IndexOutOfBoundsException( "Column index out of bounds:" + col); int i = row * columns + col; return ( (data[i >> 5] >> (i % 32)) & 1 ) != 0; } public void set(int row, int col, boolean v) { if (row < 0 || row >= rows) throw new IndexOutOfBoundsException( "Row index out of bounds:" + row); else if (col < 0 || col >= columns) throw new IndexOutOfBoundsException( "Column index out of bounds:" + col); int i = row * columns + col; int idiv32 = i >> 5; int modBit = 1 << (i % 32); data[idiv32] = v ? data[idiv32] | modBit : data[idiv32] & ~modBit; } }
sleepy brain dump
I don't understand the details of the code completely, but I always think it's cool when you can get round problems at a low level like this. Nothing beats a good understanding of how things work at the most basic level, no matter what you're doing.
Out of interest, why does it take a byte when stored in an array? Is it an actual technical reason or just a peculiarity of Java? Is there a reason you can't just store a 1 or 0 instead of whatever boolean thing? Sorry if I've got the wrong end of the stick.
I'm interested too that this was so much of a problem as to actually limit your application - think my programs are seriously inefficient so here's hoping I never run into problems like this or I'm going to be up a certain foul-smelling creek....
/should go to bed
Musings over for now
Steve
Why booleans take a byte
It has to do with how memory is accessed in most machines -- i.e. in words larger than one bit (word sizes are always multiples of 2). So basically, more than one bit is read at a time on most CPU architechtures, and this includes all major modern CPU platforms as far as I understand.
The Java language specification doesn't actually specify these 1 byte and 4 byte sizes for booleans when used in arrays and attributes, respectively. It is the Java Virtual Machine specification which defines these sizes (Java is compiled to a so-called byte code which gets run by the virtual machine and compiled to the platform-specific low-level machine readable code) during runtime. Effectively nothing stops someone from writing a JVM implementation with 1 bit long booleans, but it's going to have to use a similar bit shifting based strategy for accessing those booleans if it's run on, say, Intel x86 processors.
A question I'd like to know myself is why booleans take a whole 4 bytes when defined as attributes but 1 byte when in arrays??