[mkgmap-dev] Splitter details question

Thu Jan 14 10:36:54 GMT 2010

JG> Couldn't you ignore the overlapping region completely and insert
JG> instead a step 3a) After lookup the nodes of a way, add them to all
JG> touched tiles too.

Sort of... this is roughly the approach I'm considering, but it's not as 
straightforward as it first appears. The problem is that it's not possible 
to hold all the information required in the XML file in memory for each node. 
Currently that's not a problem; the first pass of the splitter figures out 
what the resultant split areas will be and the second pass scans through 
the source OSM file (or the disk cache), reads in each node and writes all 
of its info to the appropriate output file. But as soon as you need to write 
out additional nodes during processing of the ways, the following happens:

1) Nodes and ways get intermingled in the output OSM file. I don't think 
this is a problem at all since there still won't be any forward references 
from eg ways to nodes, but currently the splitter writes out <node/> tags 
first, then <way/> tags, then <rel/> tags so maybe this will break/annoy 
something/someone?
2) We have to somehow find out all the information about the additional node 
so we can write it out as XML. This is the killer. It means the cache becomes 
required rather than optional because it's impractical to rescan the source 
OSM file, and a disk-based index into the cache is required because even 
custom code mapping from node ID to cache index (long) takes up several GB 
of memory when splitting a large OSM file. There will be a lot of these lookups 
required, so a very fast index with minimum disk access is essential.

My intention is to experiment with a b-tree variant that uses node IDs as 
keys (the example you posted uses Strings for both keys and values so isn't 
very suitable), and additionally is able to take some advantage of the fact 
that sequential node IDs are strongly correlated to their coordinates (and 
hence the tile they'll end up in). Careful prefetch caching will make a huge 
difference here.

Another optimisation will be to cache either all the node data (or at least 
the IDs, coordinates, and cache index) in memory for nodes that fall into 
the overlap area of each tile since these will almost always be the nodes 
we're most interested in when processing the ways. Only if we can't find 
the node in that cache would we need to hit the disk. The downside is that 
this could add significant memory overhead to an already strained heap.

An unrelated improvement I'd also like to build is an R-tree (in memory) 
of the split tiles to further speed mapping nodes to areas.

JG> If the line crosses an corner of a tile, it will be clipped at the
JG> boundaries or the other tiles. So there will be a small gap exactly
JG> in the corner. I have never seen such a thing and think this is
JG> neglectable. (On the other hand, if this gap is a piece of a
JG> motorway, routing will became a little unexpected)

You don't see this happen currently because the overlap almost always catches 
both of the nodes that fall outside the tile. Without the overlap, for each 
line segment in a way that crossed a tile boundary we'd need to test whether 
it clipped another tile or not. As far as I can see, this test would be required 
even if I implement all the above but disable the current overlap fucntionality.

Hope that wasn't too much of a waffle and helps explain where I see this 
heading.

Chris