[mkgmap-dev] TYP files and character encoding
From Gerd Petermann gpetermann_muenchen at hotmail.com on Tue Jan 14 09:55:39 GMT 2020
Hi Ticker, yes, and every missing close() is a brain teaser ;) We have a few places where files are opened and closed in a different method. This is likely to cause trouble in unit tests, esp. on Windows. Whereever possible we should use try-with-ressources instead of Utils.closeFile() and add a comment like in SeaGenerator line in zipFile = new ZipFile(precompSeaDir); // don't close here! when a file is intentionally kept open. Gerd ________________________________________ Von: mkgmap-dev <mkgmap-dev-bounces at lists.mkgmap.org.uk> im Auftrag von Ticker Berkin <rwb-mkgmap at jagit.co.uk> Gesendet: Dienstag, 14. Januar 2020 10:43 An: Development list for mkgmap Betreff: Re: [mkgmap-dev] TYP files and character encoding Hi Gerd Here is updated patch that closes the file, although I find many files in mkgmap that don't have explicit close(), but I presume .finalize() will close them eventually. I'll do another patch for other text file handling, using StandardCharset where possible and fixing TokenScanner message for bad characters if not utf-8 and, if reasonable, allowing a BOM even if the file is opened as utf-8 anyway. Ticker On Tue, 2020-01-14 at 08:21 +0000, Gerd Petermann wrote: > Hi Ticker, > > thanks for the patch. > > Please review TypCompiler.CharsetProbe. BufferedReader br is not > closed. Is that intended? > > I see that we have a mix of "utf-8" and "UTF-8" in the mkgmap > sources. I think it would be good to use StandardCharsets.UTF_8 where > possible > and unify the rest. > > Gerd > > ________________________________________ > Von: mkgmap-dev <mkgmap-dev-bounces at lists.mkgmap.org.uk> im Auftrag > von Ticker Berkin <rwb-mkgmap at jagit.co.uk> > Gesendet: Montag, 13. Januar 2020 11:34 > An: Development list for mkgmap > Betreff: Re: [mkgmap-dev] TYP files and character encoding > > Hi Gerd > > I've updated this patch with changes to TypCompiler CharsetProbe: > > 1/ looks for unicode BOM in various encodings near start of file. > 2/ looks for line containing "-*- coding: charset -*-" near start of > the file. > 3/ retains the check for "CodePage=" coding for compatibility. > 4/ in the absence of the above, sets the reading charset to utf-8 if > the file is valid utf-8, otherwise to Cp1252. > 5/ fixes the bad character message from the scanner to say what the > charset really is rather than saying "uft-8" regardless. > 6/ removes the logic to that checks if String... lines, read in the > charset it is currently trying, can be encoded in the presumed output > CodePage. > > The final result of this patch should be that: > > a/ No existing usage is broken > b/ 2 methods to indicate the charset/encoding of the file that are > commonly used by text editors can be used and are taken notice of. > Previously, just the UTF-8 BOM was detected. > c/ Typ files can, and should from now on, be written in utf-8 > d/ labels for languages not supported in the --code-page of the > output > img just generate a warning in mkgmap.log.x > > Ticker > > > On Sat, 2019-12-21 at 16:11 +0000, Ticker Berkin wrote: > > Hi Gerd > > > > Attached is a patch that: > > > > Doesn't use the 'CodePage=' command in the typ-file to determine > > output > > character encoding of the typ-file, rather it uses the main map > > encoding from the --code-page argument. > > > > log.warn's any typ labels that can't be encoded in the --code-page, > > rather than just giving up with message like: > > > TYP file cannot be written in code page 1252 > > > > The message: > > > WARNING: SortCode in TYP txt file different from command line > > > setting > > that was written direct to system.out is changed to a log.warn and > > it > > shouldn't happen anyway now > > > > For the moment, the 'CodePage=' command in the typ-file is, under > > some > > circumstances, used to determine the encoding of the typ-file > > itself > > and I've left this alone for compatibility with existing useage. > > Sometime in January I'll provide a better method for this > > > > Ticker > > > > > > On Wed, 2019-12-18 at 19:54 +0000, Ticker Berkin wrote: > > > Hi Gerd > > > > > > I think it is best to continue with the ideas for typ-files that: > > > > > > 1/ they can be in any character set and we just need a better way > > > of > > > working out the correct one - see my posting earlier today. > > > > > > 2/ it can include as many languages as anyone can be bothered to > > > add, > > > and so has to be an a character set that allows the languages to > > > be > > > added, implying unicode for a common one (more particulary, UTF > > > -8) > > > > > > 3/ the codepage= statement should be redundant and ignored for > > > controlling the output character set, which should be taken from > > > the > > > map, but its use for determining the input coding might need to > > > be > > > kept > > > for a while for compatability. > > > > > > 4/ the messages my hack generates should be turned into 1 warning > > > or > > > information message per language or maybe suppressed altogether. > > > If > > > someone is generating a map with a character set that doesn't > > > support > > > a > > > particular language, they really won't care that that data for > > > other > > > languages that have an incompatible representation with their > > > language > > > won't be there. > > > > > > Ticker > > > > > > On Wed, 2019-12-18 at 19:08 +0000, Gerd Petermann wrote: > > > > Hi Ticker, > > > > > > > > I think I understand now why we didn't have a default typ file > > > > ;) > > > > If I got that right I should revert the changes in r4395 and > > > > mkgmap > > > > should not allow or warn loudly when a typ file with a > > > > different > > > > codepage is merged? > > > > Or should we force the usage of unicode codepage? > > > > Or is it possible to compile mapnik.txt with cp 1252 (or any > > > > other) > > > > in a way that only those lines which contain non-matching > > > > characters > > > > are ignored? > > > > > > > > Gerd > > > > > > > > > > > > ________________________________________ > > > > Von: mkgmap-dev <mkgmap-dev-bounces at lists.mkgmap.org.uk> im > > > > Auftrag > > > > von Ticker Berkin <rwb-mkgmap at jagit.co.uk> > > > > Gesendet: Mittwoch, 18. Dezember 2019 19:46 > > > > An: mkgmap development > > > > Betreff: [mkgmap-dev] TYP files and character encoding > > > > > > > > Hi > > > > > > > > A couple of problems with typ-files and unicode. > > > > > > > > With 'Codepage=65001' the final contents of the labels in > > > > mapnik.typ > > > > that is included with the composite map is unicode, but if the > > > > map > > > > is > > > > codepage 1252, the unicode characters with the top bit set are > > > > simply > > > > displayed as if in 1252. > > > > > > > > Removing the codepage statement from mapnik.txt and making > > > > fixes > > > > elsewhere to ensure that the file is read correctly as utf-8 > > > > and > > > > then > > > > generating a map with --code-page=1252, it gives the error: > > > > > > > > SEVE: uk.me.parabola.imgfmt.MapFailedException > > > > ../svn/trunk/resources/typ-files/mapnik.txt: > > > > (thrown in TypCompiler.makeMap()) > > > > TYP file cannot be written in code page 1252 > > > > > > > > Changing the exception handling in > > > > imgfmt/app/typ/TypElement.java, > > > > so > > > > that makeLabelBlock() reads as > > > > ... > > > > CharBuffer cb = CharBuffer.wrap(tl.getText()); > > > > try { > > > > ByteBuffer buffer = encoder.encode(cb); > > > > out.put((byte) tl.getLang()); > > > > out.put(buffer); > > > > out.put((byte) 0); > > > > } catch (CharacterCodingException ignore) { > > > > // ignore.printStackTrace(); > > > > String name = encoder.charset().name(); > > > > System.out.println("Cannot represent String=" + > > > > tl.getLang() + "," + tl.getText() + > > > > " in CodePage=" + name); > > > > // throw newTypLabelException(name); > > > > } > > > > ... > > > > > > > > It gives output like: > > > > Cannot represent String=21,Gara|e in CodePage=windows-1252 > > > > Cannot represent String=21,Obszar przemysBowy in > > > > CodePage=windows > > > > -1252 > > > > Cannot represent String=21,ZieleD in CodePage=windows-1252 > > > > Cannot represent String=21,Zaro[la in CodePage=windows-1252 > > > > Cannot represent String=21,MokradBa in CodePage=windows-1252 > > > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in > > > > CodePage=windows-1252 > > > > Cannot represent String=21,Droga szybkiego ruchu (B^Ecznik) in > > > > CodePage=windows-1252 > > > > Cannot represent String=21,Droga szybkiego ruchu (B^Ecznik) in > > > > CodePage=windows-1252 > > > > Cannot represent String=21,Zcie|ka rowerowa in CodePage=windows > > > > -1252 > > > > Cannot represent String=21,Wybrze|e in CodePage=windows-1252 > > > > Cannot represent String=21,Zcie|ka in CodePage=windows-1252 > > > > Cannot represent String=21,StrumieD in CodePage=windows-1252 > > > > Cannot represent String=21,Granica paDstwa in CodePage=windows > > > > -1252 > > > > Cannot represent String=21,Rzeka, KanaB in CodePage=windows > > > > -1252 > > > > Cannot represent String=21,StrumieD in CodePage=windows-1252 > > > > Cannot represent String=21,Ruroci^Eg in CodePage=windows-1252 > > > > Cannot represent String=21,Kabel wysokiego napi^Ycia in > > > > CodePage=windows-1252 > > > > Cannot represent String=21,Tor wy[cigowy in CodePage=windows > > > > -1252 > > > > Cannot represent String=21,Droga szybkiego ruchu (B^Ecznik) in > > > > CodePage=windows-1252 > > > > Cannot represent String=21,Droga krajowa (B^Ecznik) in > > > > CodePage=windows > > > > -1252 > > > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in > > > > CodePage=windows-1252 > > > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows > > > > -1252 > > > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows > > > > -1252 > > > > Cannot represent String=21,Restauracja (AmerykaDska) in > > > > CodePage=windows-1252 > > > > Cannot represent String=21,Restauracja (ChiDska) in > > > > CodePage=windows > > > > -1252 > > > > Cannot represent String=21,Restauracja (Mi^Ydzynarodowa) in > > > > CodePage=windows-1252 > > > > Cannot represent String=21,Restauracja (WBoska) in > > > > CodePage=windows > > > > -1252 > > > > Cannot represent String=21,Restauracja (MeksykaDska) in > > > > CodePage=windows-1252 > > > > Cannot represent String=21,Restauracja (P^Eczki) in > > > > CodePage=windows > > > > -1252 > > > > Cannot represent String=21,Restauracja (WegetariaDska) in > > > > CodePage=windows-1252 > > > > Cannot represent String=21,Kr^Ygle in CodePage=windows-1252 > > > > Cannot represent String=21,Sklep odzie|owy in CodePage=windows > > > > -1252 > > > > Cannot represent String=21,Wypo|yczalnia samochod\363w in > > > > CodePage=windows-1252 > > > > Cannot represent String=21,Gara| in CodePage=windows-1252 > > > > Cannot represent String=21,Sprzeda| samochod\363w in > > > > CodePage=windows > > > > -1252 > > > > Cannot represent String=21,Sklep |eglarski in CodePage=windows > > > > -1252 > > > > Cannot represent String=21,S^Ed in CodePage=windows-1252 > > > > Cannot represent String=21,O[rodek kultury in CodePage=windows > > > > -1252 > > > > Cannot represent String=21,Wi^Yzienie in CodePage=windows-1252 > > > > Cannot represent String=21,Stra| po|arna in CodePage=windows > > > > -1252 > > > > Cannot represent String=21,SBupek in CodePage=windows-1252 > > > > Cannot represent String=21,PrzystaD in CodePage=windows-1252 > > > > Cannot represent String=21,L^Edowisko helikopterowe in > > > > CodePage=windows > > > > -1252 > > > > Cannot represent String=21,Wie|a in CodePage=windows-1252 > > > > Cannot represent String=21,yr\363dBo in CodePage=windows-1252 > > > > Cannot represent String=21,Pla|a in CodePage=windows-1252 > > > > Cannot represent String=21,Przyl^Edek in CodePage=windows-1252 > > > > Cannot represent String=21,SkaBa in CodePage=windows-1252 > > > > > > > > Which makes sense if codepage 1252 doesn't handle Polish (hex > > > > 0x15, > > > > decimal 21). > > > > > > > > NB the non ascii characters in above are messed up by my > > > > cutting > > > > and > > > > pasting. > > > > > > > > Checking the French, on my Garmin device, the type descriptions > > > > now > > > > display accents correctly. > > > > > > > > Ticker > > > > > > > > _______________________________________________ > > > > mkgmap-dev mailing list > > > > mkgmap-dev at lists.mkgmap.org.uk > > > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev > > > > _______________________________________________ > > > > mkgmap-dev mailing list > > > > mkgmap-dev at lists.mkgmap.org.uk > > > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev > > > _______________________________________________ > > > mkgmap-dev mailing list > > > mkgmap-dev at lists.mkgmap.org.uk > > _______________________________________________ > > mkgmap-dev mailing list > > mkgmap-dev at lists.mkgmap.org.uk > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
- Previous message: [mkgmap-dev] TYP files and character encoding
- Next message: [mkgmap-dev] TYP files and character encoding
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the mkgmap-dev mailing list