[mkgmap-dev] TYP files and character encoding
From Ticker Berkin rwb-mkgmap at jagit.co.uk on Thu Jan 16 23:02:02 GMT 2020
Hi Gerd I've just noticed that a change to a function profile stopped a test from compiling, so here is a patch for that Ticker On Tue, 2020-01-14 at 09:43 +0000, Ticker Berkin wrote: > Hi Gerd > > Here is updated patch that closes the file, although I find many > files > in mkgmap that don't have explicit close(), but I presume .finalize() > will close them eventually. > > I'll do another patch for other text file handling, using > StandardCharset where possible and fixing TokenScanner message for > bad > characters if not utf-8 and, if reasonable, allowing a BOM even if > the > file is opened as utf-8 anyway. > > Ticker > > On Tue, 2020-01-14 at 08:21 +0000, Gerd Petermann wrote: > > Hi Ticker, > > > > thanks for the patch. > > > > Please review TypCompiler.CharsetProbe. BufferedReader br is not > > closed. Is that intended? > > > > I see that we have a mix of "utf-8" and "UTF-8" in the mkgmap > > sources. I think it would be good to use StandardCharsets.UTF_8 > > where > > possible > > and unify the rest. > > > > Gerd > > > > ________________________________________ > > Von: mkgmap-dev <mkgmap-dev-bounces at lists.mkgmap.org.uk> im Auftrag > > von Ticker Berkin <rwb-mkgmap at jagit.co.uk> > > Gesendet: Montag, 13. Januar 2020 11:34 > > An: Development list for mkgmap > > Betreff: Re: [mkgmap-dev] TYP files and character encoding > > > > Hi Gerd > > > > I've updated this patch with changes to TypCompiler CharsetProbe: > > > > 1/ looks for unicode BOM in various encodings near start of file. > > 2/ looks for line containing "-*- coding: charset -*-" near start > > of > > the file. > > 3/ retains the check for "CodePage=" coding for compatibility. > > 4/ in the absence of the above, sets the reading charset to utf-8 > > if > > the file is valid utf-8, otherwise to Cp1252. > > 5/ fixes the bad character message from the scanner to say what the > > charset really is rather than saying "uft-8" regardless. > > 6/ removes the logic to that checks if String... lines, read in the > > charset it is currently trying, can be encoded in the presumed > > output > > CodePage. > > > > The final result of this patch should be that: > > > > a/ No existing usage is broken > > b/ 2 methods to indicate the charset/encoding of the file that are > > commonly used by text editors can be used and are taken notice of. > > Previously, just the UTF-8 BOM was detected. > > c/ Typ files can, and should from now on, be written in utf-8 > > d/ labels for languages not supported in the --code-page of the > > output > > img just generate a warning in mkgmap.log.x > > > > Ticker > > > > > > On Sat, 2019-12-21 at 16:11 +0000, Ticker Berkin wrote: > > > Hi Gerd > > > > > > Attached is a patch that: > > > > > > Doesn't use the 'CodePage=' command in the typ-file to determine > > > output > > > character encoding of the typ-file, rather it uses the main map > > > encoding from the --code-page argument. > > > > > > log.warn's any typ labels that can't be encoded in the --code > > > -page, > > > rather than just giving up with message like: > > > > TYP file cannot be written in code page 1252 > > > > > > The message: > > > > WARNING: SortCode in TYP txt file different from command line > > > > setting > > > that was written direct to system.out is changed to a log.warn > > > and > > > it > > > shouldn't happen anyway now > > > > > > For the moment, the 'CodePage=' command in the typ-file is, under > > > some > > > circumstances, used to determine the encoding of the typ-file > > > itself > > > and I've left this alone for compatibility with existing useage. > > > Sometime in January I'll provide a better method for this > > > > > > Ticker > > > > > > > > > On Wed, 2019-12-18 at 19:54 +0000, Ticker Berkin wrote: > > > > Hi Gerd > > > > > > > > I think it is best to continue with the ideas for typ-files > > > > that: > > > > > > > > 1/ they can be in any character set and we just need a better > > > > way > > > > of > > > > working out the correct one - see my posting earlier today. > > > > > > > > 2/ it can include as many languages as anyone can be bothered > > > > to > > > > add, > > > > and so has to be an a character set that allows the languages > > > > to > > > > be > > > > added, implying unicode for a common one (more particulary, UTF > > > > -8) > > > > > > > > 3/ the codepage= statement should be redundant and ignored for > > > > controlling the output character set, which should be taken > > > > from > > > > the > > > > map, but its use for determining the input coding might need to > > > > be > > > > kept > > > > for a while for compatability. > > > > > > > > 4/ the messages my hack generates should be turned into 1 > > > > warning > > > > or > > > > information message per language or maybe suppressed > > > > altogether. > > > > If > > > > someone is generating a map with a character set that doesn't > > > > support > > > > a > > > > particular language, they really won't care that that data for > > > > other > > > > languages that have an incompatible representation with their > > > > language > > > > won't be there. > > > > > > > > Ticker > > > > > > > > On Wed, 2019-12-18 at 19:08 +0000, Gerd Petermann wrote: > > > > > Hi Ticker, > > > > > > > > > > I think I understand now why we didn't have a default typ > > > > > file > > > > > ;) > > > > > If I got that right I should revert the changes in r4395 and > > > > > mkgmap > > > > > should not allow or warn loudly when a typ file with a > > > > > different > > > > > codepage is merged? > > > > > Or should we force the usage of unicode codepage? > > > > > Or is it possible to compile mapnik.txt with cp 1252 (or any > > > > > other) > > > > > in a way that only those lines which contain non-matching > > > > > characters > > > > > are ignored? > > > > > > > > > > Gerd > > > > > > > > > > > > > > > ________________________________________ > > > > > Von: mkgmap-dev <mkgmap-dev-bounces at lists.mkgmap.org.uk> im > > > > > Auftrag > > > > > von Ticker Berkin <rwb-mkgmap at jagit.co.uk> > > > > > Gesendet: Mittwoch, 18. Dezember 2019 19:46 > > > > > An: mkgmap development > > > > > Betreff: [mkgmap-dev] TYP files and character encoding > > > > > > > > > > Hi > > > > > > > > > > A couple of problems with typ-files and unicode. > > > > > > > > > > With 'Codepage=65001' the final contents of the labels in > > > > > mapnik.typ > > > > > that is included with the composite map is unicode, but if > > > > > the > > > > > map > > > > > is > > > > > codepage 1252, the unicode characters with the top bit set > > > > > are > > > > > simply > > > > > displayed as if in 1252. > > > > > > > > > > Removing the codepage statement from mapnik.txt and making > > > > > fixes > > > > > elsewhere to ensure that the file is read correctly as utf-8 > > > > > and > > > > > then > > > > > generating a map with --code-page=1252, it gives the error: > > > > > > > > > > SEVE: uk.me.parabola.imgfmt.MapFailedException > > > > > ../svn/trunk/resources/typ-files/mapnik.txt: > > > > > (thrown in TypCompiler.makeMap()) > > > > > TYP file cannot be written in code page 1252 > > > > > > > > > > Changing the exception handling in > > > > > imgfmt/app/typ/TypElement.java, > > > > > so > > > > > that makeLabelBlock() reads as > > > > > ... > > > > > CharBuffer cb = CharBuffer.wrap(tl.getText()); > > > > > try { > > > > > ByteBuffer buffer = encoder.encode(cb); > > > > > out.put((byte) tl.getLang()); > > > > > out.put(buffer); > > > > > out.put((byte) 0); > > > > > } catch (CharacterCodingException ignore) { > > > > > // ignore.printStackTrace(); > > > > > String name = encoder.charset().name(); > > > > > System.out.println("Cannot represent String=" + > > > > > tl.getLang() + "," + tl.getText() + > > > > > " in CodePage=" + name); > > > > > // throw newTypLabelException(name); > > > > > } > > > > > ... > > > > > > > > > > It gives output like: > > > > > Cannot represent String=21,Gara|e in CodePage=windows-1252 > > > > > Cannot represent String=21,Obszar przemysBowy in > > > > > CodePage=windows > > > > > -1252 > > > > > Cannot represent String=21,ZieleD in CodePage=windows-1252 > > > > > Cannot represent String=21,Zaro[la in CodePage=windows-1252 > > > > > Cannot represent String=21,MokradBa in CodePage=windows-1252 > > > > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in > > > > > CodePage=windows-1252 > > > > > Cannot represent String=21,Droga szybkiego ruchu (B^Ecznik) > > > > > in > > > > > CodePage=windows-1252 > > > > > Cannot represent String=21,Droga szybkiego ruchu (B^Ecznik) > > > > > in > > > > > CodePage=windows-1252 > > > > > Cannot represent String=21,Zcie|ka rowerowa in > > > > > CodePage=windows > > > > > -1252 > > > > > Cannot represent String=21,Wybrze|e in CodePage=windows-1252 > > > > > Cannot represent String=21,Zcie|ka in CodePage=windows-1252 > > > > > Cannot represent String=21,StrumieD in CodePage=windows-1252 > > > > > Cannot represent String=21,Granica paDstwa in > > > > > CodePage=windows > > > > > -1252 > > > > > Cannot represent String=21,Rzeka, KanaB in CodePage=windows > > > > > -1252 > > > > > Cannot represent String=21,StrumieD in CodePage=windows-1252 > > > > > Cannot represent String=21,Ruroci^Eg in CodePage=windows-1252 > > > > > Cannot represent String=21,Kabel wysokiego napi^Ycia in > > > > > CodePage=windows-1252 > > > > > Cannot represent String=21,Tor wy[cigowy in CodePage=windows > > > > > -1252 > > > > > Cannot represent String=21,Droga szybkiego ruchu (B^Ecznik) > > > > > in > > > > > CodePage=windows-1252 > > > > > Cannot represent String=21,Droga krajowa (B^Ecznik) in > > > > > CodePage=windows > > > > > -1252 > > > > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in > > > > > CodePage=windows-1252 > > > > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows > > > > > -1252 > > > > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows > > > > > -1252 > > > > > Cannot represent String=21,Restauracja (AmerykaDska) in > > > > > CodePage=windows-1252 > > > > > Cannot represent String=21,Restauracja (ChiDska) in > > > > > CodePage=windows > > > > > -1252 > > > > > Cannot represent String=21,Restauracja (Mi^Ydzynarodowa) in > > > > > CodePage=windows-1252 > > > > > Cannot represent String=21,Restauracja (WBoska) in > > > > > CodePage=windows > > > > > -1252 > > > > > Cannot represent String=21,Restauracja (MeksykaDska) in > > > > > CodePage=windows-1252 > > > > > Cannot represent String=21,Restauracja (P^Eczki) in > > > > > CodePage=windows > > > > > -1252 > > > > > Cannot represent String=21,Restauracja (WegetariaDska) in > > > > > CodePage=windows-1252 > > > > > Cannot represent String=21,Kr^Ygle in CodePage=windows-1252 > > > > > Cannot represent String=21,Sklep odzie|owy in > > > > > CodePage=windows > > > > > -1252 > > > > > Cannot represent String=21,Wypo|yczalnia samochod\363w in > > > > > CodePage=windows-1252 > > > > > Cannot represent String=21,Gara| in CodePage=windows-1252 > > > > > Cannot represent String=21,Sprzeda| samochod\363w in > > > > > CodePage=windows > > > > > -1252 > > > > > Cannot represent String=21,Sklep |eglarski in > > > > > CodePage=windows > > > > > -1252 > > > > > Cannot represent String=21,S^Ed in CodePage=windows-1252 > > > > > Cannot represent String=21,O[rodek kultury in > > > > > CodePage=windows > > > > > -1252 > > > > > Cannot represent String=21,Wi^Yzienie in CodePage=windows > > > > > -1252 > > > > > Cannot represent String=21,Stra| po|arna in CodePage=windows > > > > > -1252 > > > > > Cannot represent String=21,SBupek in CodePage=windows-1252 > > > > > Cannot represent String=21,PrzystaD in CodePage=windows-1252 > > > > > Cannot represent String=21,L^Edowisko helikopterowe in > > > > > CodePage=windows > > > > > -1252 > > > > > Cannot represent String=21,Wie|a in CodePage=windows-1252 > > > > > Cannot represent String=21,yr\363dBo in CodePage=windows-1252 > > > > > Cannot represent String=21,Pla|a in CodePage=windows-1252 > > > > > Cannot represent String=21,Przyl^Edek in CodePage=windows > > > > > -1252 > > > > > Cannot represent String=21,SkaBa in CodePage=windows-1252 > > > > > > > > > > Which makes sense if codepage 1252 doesn't handle Polish (hex > > > > > 0x15, > > > > > decimal 21). > > > > > > > > > > NB the non ascii characters in above are messed up by my > > > > > cutting > > > > > and > > > > > pasting. > > > > > > > > > > Checking the French, on my Garmin device, the type > > > > > descriptions > > > > > now > > > > > display accents correctly. > > > > > > > > > > Ticker > > > > > > > > > > _______________________________________________ > > > > > mkgmap-dev mailing list > > > > > mkgmap-dev at lists.mkgmap.org.uk > > > > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev > > > > > _______________________________________________ > > > > > mkgmap-dev mailing list > > > > > mkgmap-dev at lists.mkgmap.org.uk > > > > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev > > > > _______________________________________________ > > > > mkgmap-dev mailing list > > > > mkgmap-dev at lists.mkgmap.org.uk > > > _______________________________________________ > > > mkgmap-dev mailing list > > > mkgmap-dev at lists.mkgmap.org.uk > _______________________________________________ > mkgmap-dev mailing list > mkgmap-dev at lists.mkgmap.org.uk > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev -------------- next part -------------- A non-text attachment was scrubbed... Name: typCodePage-test.patch Type: text/x-patch Size: 1159 bytes Desc: not available URL: <http://www.mkgmap.org.uk/pipermail/mkgmap-dev/attachments/20200116/d90a74f8/attachment-0001.bin>
- Previous message: [mkgmap-dev] StandardCharsets and try (with-resources)
- Next message: [mkgmap-dev] TYP files and character encoding
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the mkgmap-dev mailing list