[mkgmap-dev] character repertoires

Mon Feb 25 20:03:40 GMT 2013

Hi

> It actually is the CP1252 superset of ISO 8859-1, I see the printable
> characters in the range 128–159 (which ISO 8859 reserves for a second
> set of control characters).

Your observations neatly illustrate the way that the code works.

This is the current algorithm:

1a. if ascii(no-code-page): all characters > 0x7f are transliterated
     into ascii characters
1b. if code-page=1252: all characters > 0xff are transliterated into
     latin1 characters.
1c. all other code pages: no transliteration.

2. Create a character set name by prepending "cp" to the
    code-page (eg.  cp1252).

3. Use the standard java character set conversion with that name
    to convert the result of step 1.  Any character that cannot be
    converted is replaced with a '?' symbol. This may possibly vary
    with java version and platform.

That explains most of the observations I think. U+2021 is
transliterated to ++ for 1252, but not for any other 125x
Same for the Euro symbol to Eu.

> The micro sign U+00B5 μ becomes a ? on most code page maps, except for
> the Greek one, even though it is at the same position in all code pages.

U+00b5 is upper cased to GREEK CAPITAL LETTER MU, which is only present
in the Greek code page.

> And in the Arabic map’s upper half, the latin based characters show up
> as “?”.

That's because only lower case characters are included.

> Another peculiar thing: while the Garmin does its usual wierd
> upper/lower casing, TWO LABELS ARE ALL CAPS, namely those containing
> the ª feminine and º masculine ordinal indicators.

I don't know about this. Possibly a device thing?

> Asian:
> A map with CP1258 shows up with totally unlabeled streets, not even
> anything from the ASCII range.

Strange - are labels correct in the file? If you run strings on the img
do you see the ascii labels? If so then it is a device thing.

So currently ascii and 1252 are better than the other code pages since
just about every unicode character can be represented, whereas in the
other code pages you are limited to characters from that page.  It
looks possible to fix this by removing the transliteration step from
where it is and only using it when a character that is un-mappable
into the target code page is encountered.

..Steve