logo separator

[mkgmap-dev] TYP files and character encoding

From Ticker Berkin rwb-mkgmap at jagit.co.uk on Thu Jan 16 23:02:02 GMT 2020

Hi Gerd

I've just noticed that a change to a function profile stopped a test
from compiling, so here is a patch for that

Ticker

On Tue, 2020-01-14 at 09:43 +0000, Ticker Berkin wrote:
> Hi Gerd
> 
> Here is updated patch that closes the file, although I find many
> files
> in mkgmap that don't have explicit close(), but I presume .finalize()
> will close them eventually.
> 
> I'll do another patch for other text file handling, using
> StandardCharset where possible and fixing TokenScanner message for
> bad
> characters if not utf-8 and, if reasonable, allowing a BOM even if
> the
> file is opened as utf-8 anyway.
> 
> Ticker
> 
> On Tue, 2020-01-14 at 08:21 +0000, Gerd Petermann wrote:
> > Hi Ticker,
> > 
> > thanks for the patch.
> > 
> > Please review TypCompiler.CharsetProbe.  BufferedReader br is not
> > closed. Is that intended?
> > 
> > I see that we have a mix of "utf-8" and "UTF-8" in the mkgmap
> > sources. I think it would be good to use StandardCharsets.UTF_8
> > where
> > possible
> > and unify the rest.
> > 
> > Gerd
> > 
> > ________________________________________
> > Von: mkgmap-dev <mkgmap-dev-bounces at lists.mkgmap.org.uk> im Auftrag
> > von Ticker Berkin <rwb-mkgmap at jagit.co.uk>
> > Gesendet: Montag, 13. Januar 2020 11:34
> > An: Development list for mkgmap
> > Betreff: Re: [mkgmap-dev] TYP files and character encoding
> > 
> > Hi Gerd
> > 
> > I've updated this patch with changes to TypCompiler CharsetProbe:
> > 
> > 1/ looks for unicode BOM in various encodings near start of file.
> > 2/ looks for line containing "-*- coding: charset -*-" near start
> > of
> > the file.
> > 3/ retains the check for "CodePage=" coding for compatibility.
> > 4/ in the absence of the above, sets the reading charset to utf-8
> > if
> > the file is valid utf-8, otherwise to Cp1252.
> > 5/ fixes the bad character message from the scanner to say what the
> > charset really is rather than saying "uft-8" regardless.
> > 6/ removes the logic to that checks if String... lines, read in the
> > charset it is currently trying, can be encoded in the presumed
> > output
> > CodePage.
> > 
> > The final result of this patch should be that:
> > 
> > a/ No existing usage is broken
> > b/ 2 methods to indicate the charset/encoding of the file that are
> > commonly used by text editors can be used and are taken notice of.
> > Previously, just the UTF-8 BOM was detected.
> > c/ Typ files can, and should from now on, be written in utf-8
> > d/ labels for languages not supported in the --code-page of the
> > output
> > img just generate a warning in mkgmap.log.x
> > 
> > Ticker
> > 
> > 
> > On Sat, 2019-12-21 at 16:11 +0000, Ticker Berkin wrote:
> > > Hi Gerd
> > > 
> > > Attached is a patch that:
> > > 
> > > Doesn't use the 'CodePage=' command in the typ-file to determine
> > > output
> > > character encoding of the typ-file, rather it uses the main map
> > > encoding from the --code-page argument.
> > > 
> > > log.warn's any typ labels that can't be encoded in the --code
> > > -page,
> > > rather than just giving up with message like:
> > > > TYP file cannot be written in code page 1252
> > > 
> > > The message:
> > > > WARNING: SortCode in TYP txt file different from command line
> > > > setting
> > > that was written direct to system.out is changed to a log.warn
> > > and
> > > it
> > > shouldn't happen anyway now
> > > 
> > > For the moment, the 'CodePage=' command in the typ-file is, under
> > > some
> > > circumstances, used to determine the encoding of the typ-file
> > > itself
> > > and I've left this alone for compatibility with existing useage.
> > > Sometime in January I'll provide a better method for this
> > > 
> > > Ticker
> > > 
> > > 
> > > On Wed, 2019-12-18 at 19:54 +0000, Ticker Berkin wrote:
> > > > Hi Gerd
> > > > 
> > > > I think it is best to continue with the ideas for typ-files
> > > > that:
> > > > 
> > > > 1/ they can be in any character set and we just need a better
> > > > way
> > > > of
> > > > working out the correct one - see my posting earlier today.
> > > > 
> > > > 2/ it can include as many languages as anyone can be bothered
> > > > to
> > > > add,
> > > > and so has to be an a character set that allows the languages
> > > > to
> > > > be
> > > > added, implying unicode for a common one (more particulary, UTF
> > > > -8)
> > > > 
> > > > 3/ the codepage= statement should be redundant and ignored for
> > > > controlling the output character set, which should be taken
> > > > from
> > > > the
> > > > map, but its use for determining the input coding might need to
> > > > be
> > > > kept
> > > > for a while for compatability.
> > > > 
> > > > 4/ the messages my hack generates should be turned into 1
> > > > warning
> > > > or
> > > > information message per language or maybe suppressed
> > > > altogether.
> > > > If
> > > > someone is generating a map with a character set that doesn't
> > > > support
> > > > a
> > > > particular language, they really won't care that that data for
> > > > other
> > > > languages that have an incompatible representation with their
> > > > language
> > > > won't be there.
> > > > 
> > > > Ticker
> > > > 
> > > > On Wed, 2019-12-18 at 19:08 +0000, Gerd Petermann wrote:
> > > > > Hi Ticker,
> > > > > 
> > > > > I think I understand now why we didn't have a default typ
> > > > > file
> > > > > ;)
> > > > > If I got that right I should revert the changes in r4395 and
> > > > > mkgmap
> > > > > should not allow or warn loudly when a typ file with a
> > > > > different
> > > > > codepage is merged?
> > > > > Or should we force the usage of unicode codepage?
> > > > > Or is it possible to compile mapnik.txt with cp 1252 (or any
> > > > > other)
> > > > > in a way that only those lines which contain non-matching
> > > > > characters
> > > > > are ignored?
> > > > > 
> > > > > Gerd
> > > > > 
> > > > > 
> > > > > ________________________________________
> > > > > Von: mkgmap-dev <mkgmap-dev-bounces at lists.mkgmap.org.uk> im
> > > > > Auftrag
> > > > > von Ticker Berkin <rwb-mkgmap at jagit.co.uk>
> > > > > Gesendet: Mittwoch, 18. Dezember 2019 19:46
> > > > > An: mkgmap development
> > > > > Betreff: [mkgmap-dev] TYP files and character encoding
> > > > > 
> > > > > Hi
> > > > > 
> > > > > A couple of problems with typ-files and unicode.
> > > > > 
> > > > > With 'Codepage=65001' the final contents of the labels in
> > > > > mapnik.typ
> > > > > that is included with the composite map is unicode, but if
> > > > > the
> > > > > map
> > > > > is
> > > > > codepage 1252, the unicode characters with the top bit set
> > > > > are
> > > > > simply
> > > > > displayed as if in 1252.
> > > > > 
> > > > > Removing the codepage statement from mapnik.txt and making
> > > > > fixes
> > > > > elsewhere to ensure that the file is read correctly as utf-8
> > > > > and
> > > > > then
> > > > > generating a map with --code-page=1252, it gives the error:
> > > > > 
> > > > > SEVE: uk.me.parabola.imgfmt.MapFailedException
> > > > >  ../svn/trunk/resources/typ-files/mapnik.txt:
> > > > >  (thrown in TypCompiler.makeMap())
> > > > >  TYP file cannot be written in code page 1252
> > > > > 
> > > > > Changing the exception handling in
> > > > > imgfmt/app/typ/TypElement.java,
> > > > > so
> > > > > that makeLabelBlock() reads as
> > > > > ...
> > > > >     CharBuffer cb = CharBuffer.wrap(tl.getText());
> > > > >     try {
> > > > >         ByteBuffer buffer = encoder.encode(cb);
> > > > >         out.put((byte) tl.getLang());
> > > > >         out.put(buffer);
> > > > >         out.put((byte) 0);
> > > > >      }  catch (CharacterCodingException ignore) {
> > > > > //        ignore.printStackTrace();
> > > > >         String name = encoder.charset().name();
> > > > >         System.out.println("Cannot represent String=" +
> > > > >             tl.getLang() + "," + tl.getText() +
> > > > >             " in CodePage=" + name);
> > > > > //        throw newTypLabelException(name);
> > > > >      }
> > > > > ...
> > > > > 
> > > > > It gives output like:
> > > > > Cannot represent String=21,Gara|e in CodePage=windows-1252
> > > > > Cannot represent String=21,Obszar przemysBowy in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,ZieleD in CodePage=windows-1252
> > > > > Cannot represent String=21,Zaro[la in CodePage=windows-1252
> > > > > Cannot represent String=21,MokradBa in CodePage=windows-1252
> > > > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik)
> > > > > in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik)
> > > > > in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Zcie|ka rowerowa in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Wybrze|e in CodePage=windows-1252
> > > > > Cannot represent String=21,Zcie|ka in CodePage=windows-1252
> > > > > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > > > > Cannot represent String=21,Granica paDstwa in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Rzeka, KanaB in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > > > > Cannot represent String=21,Ruroci^Eg in CodePage=windows-1252
> > > > > Cannot represent String=21,Kabel wysokiego napi^Ycia in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Tor wy[cigowy in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik)
> > > > > in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Droga krajowa (B^Ecznik) in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Restauracja (AmerykaDska) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Restauracja (ChiDska) in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Restauracja (Mi^Ydzynarodowa) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Restauracja (WBoska) in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Restauracja (MeksykaDska) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Restauracja (P^Eczki) in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Restauracja (WegetariaDska) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Kr^Ygle in CodePage=windows-1252
> > > > > Cannot represent String=21,Sklep odzie|owy in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Wypo|yczalnia samochod\363w in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Gara| in CodePage=windows-1252
> > > > > Cannot represent String=21,Sprzeda| samochod\363w in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Sklep |eglarski in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,S^Ed in CodePage=windows-1252
> > > > > Cannot represent String=21,O[rodek kultury in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Wi^Yzienie in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Stra| po|arna in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,SBupek in CodePage=windows-1252
> > > > > Cannot represent String=21,PrzystaD in CodePage=windows-1252
> > > > > Cannot represent String=21,L^Edowisko helikopterowe in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Wie|a in CodePage=windows-1252
> > > > > Cannot represent String=21,yr\363dBo in CodePage=windows-1252
> > > > > Cannot represent String=21,Pla|a in CodePage=windows-1252
> > > > > Cannot represent String=21,Przyl^Edek in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,SkaBa in CodePage=windows-1252
> > > > > 
> > > > > Which makes sense if codepage 1252 doesn't handle Polish (hex
> > > > > 0x15,
> > > > > decimal 21).
> > > > > 
> > > > > NB the non ascii characters in above are messed up by my
> > > > > cutting
> > > > > and
> > > > > pasting.
> > > > > 
> > > > > Checking the French, on my Garmin device, the type
> > > > > descriptions
> > > > > now
> > > > > display accents correctly.
> > > > > 
> > > > > Ticker
> > > > > 
> > > > > _______________________________________________
> > > > > mkgmap-dev mailing list
> > > > > mkgmap-dev at lists.mkgmap.org.uk
> > > > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > > > > _______________________________________________
> > > > > mkgmap-dev mailing list
> > > > > mkgmap-dev at lists.mkgmap.org.uk
> > > > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > > > _______________________________________________
> > > > mkgmap-dev mailing list
> > > > mkgmap-dev at lists.mkgmap.org.uk
> > > _______________________________________________
> > > mkgmap-dev mailing list
> > > mkgmap-dev at lists.mkgmap.org.uk
> _______________________________________________
> mkgmap-dev mailing list
> mkgmap-dev at lists.mkgmap.org.uk
> http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: typCodePage-test.patch
Type: text/x-patch
Size: 1159 bytes
Desc: not available
URL: <http://www.mkgmap.org.uk/pipermail/mkgmap-dev/attachments/20200116/d90a74f8/attachment-0001.bin>


More information about the mkgmap-dev mailing list