[mkgmap-dev] splitter - performance and memory

Sun Jul 19 19:11:31 BST 2009

First off, let me say a quick thanks to everyone involved in this project. 
I've only just discovered it recently but wish I had found it much earlier, 
it really is incredible to see what has been done so far.

I've downloaded the complete planet-090715.osm.bz2 file, and have been looking 
at splitting it. I read the description and limitations of the splitter.jar 
tool but decided to give it a go anyway since I have a 64 bit OS with 6GB 
ram. Unfortunately it still failed with a -Xmx5200m heap. I have a 16GB machine 
at work I could try it on that might work, however instead I decided to take 
a look at the source code to see if there's any possibility of reducing the 
memory requirements.

I've only spent a short time looking at the code, but as far as I can tell 
the whole first step (computing the areas.list file) is using far more memory 
than it actually needs. The SplitIntMap (which is what takes up all the memory) 
isn't required here for two reasons. One is that the code never retrieves 
entries via .get(), rather it only uses an iterator so a list/array would 
suffice. Second, the node IDs aren't used in this stage so we don't even 
need to parse them let alone hold on to them. Assuming we replace the SplitIntMap 
with a wrapper around an int[] (or multiple int[] to mitigate the double-memory-on-copy 
problem), we'd be looking at memory savings of >50%.

Does that make sense or have I missed something? If it sounds sensible I'd 
be happy to have a go at implementing it. Also, given the nature of the algorithm 
it wouldn't be too hard on performance if the lat+long values were written 
out to disk rather than held in memory which would mean splitting the whole 
dataset would be feasible even on a 32bit machine.

I haven't yet looked at possibilities for tuning the second step but I assume 
that some sort of map/lookup is still required here. I figure there's a few 
options - perform multiple passes processing a subset of the splits at a 
time (limited by the total number of nodes we can hold in memory), optimise 
the existing data structures further, page some out to disk, etc.

I was also thinking a little about performance. Given the enormous size of 
the full .osm file, I'd suggest a move away from SAX over to a pull parser 
(http://www.extreme.indiana.edu/xgws/xsoap/xpp/mxp1/index.html). It's even 
faster than SAX and uses very little memory. In my job we use it to parse 
many GB of XML daily with very good results. Another idea is to parallelise 
the code by running parts of the split on different threads to take advantage 
of multi-core CPUs. Possibly the best gain here would be when writing the 
files since gzip compression is fairly CPU intensive.

What do people think? I'm happy to work on the above though I must confess 
up front that my spare time is quite limited so please don't expect too much 
too soon!

Chris