[mkgmap-dev] New branch for default typ file
From Ticker Berkin rwb-mkgmap at jagit.co.uk on Wed Dec 18 11:06:19 GMT 2019
Hi Randolph This topic should probably become a new thread. You shouldn't confuse the encoding of the java source text (rules determined by the java language) with how a java program reads a text file into its internal character format (however the programmers want to do it, but the java library supplies converters for almost all character sets/encodings). I agree that the text file input processing of mkgmap should allow for a BOM in all cases and use it to determine the correct unicode input decoding. There are various possible input files with a mix of character set/encoding determination and BOM acceptance. A quick look for the the various txt inputs, I find: style components: In the default style, all are pure 7-bit ascii. except inc/address which contains some UTF-8 encoded characters. road-name-config: this is read as UTF-8. TYP: This checks for a UFT-8 BOM as the first character on a line, and, if not there, looks for a line starting with 'CodePage=' and uses what follows, with cp65001 taken to mean UTF-8. It has some logic to default to cp1252 and some other convolutions. There are many incorrect assumptions in this handling, the main one being that CodePage is there to determine the output charset, which can be determined from the main mkgmap map options anyway. -c options.cfg: I haven't studied the logic for this, but it probably uses the character set/encoding determined by Java from the environment; on unix maybe $LANG with typical value "en_GB.UTF-8" command line parameters: ditto copyright/licence-file: not looked delete-tags-file: not looked other files: ? Most of these areas could benefit from a unified way of determining the input character set and encoding, but we need to beware of backward compatibility, where users have their own components in a code-page relevant to their area. I suggest something like the following, in order: 1/ Look for a BOM for any of the unicode encodings near the start of the file; not necessarily the first character, because, without changing the next level of the file parser, it might need to be in a comment. 2/ Look for the 1st or 2nd line of the format: {comment-indicator} -*- coding: {charset} -*- where {comment-indicator} is typically a '#'. and {charset}, for unicode, represents the encoding as well. This method is used by Python and was common on unix systems and recognised by many text editors before UTF-8 became ubiquitous. 3/ Default to UTF-8 or the environmental default depending on context, to be compatible with current handling. Ticker On Tue, 2019-12-17 at 15:20 -0600, Randolph J. Herber wrote: > Dear Sirs: > There has been a thread of discussion of whether there should be a > Beginning Of Message (BOM) at the beginning of a UTF-8 file. > This discussion is complicated by the fact that some of the > developers work on Unix, Linux, BSD, iOS, Solaris and Windows. These > operating systems have UTF-8 handling libraries written at different > times and to different Unicode standards. Originally the Unicode > standard said that UTF-8 should not have a BOM character at the > beginning of a file. Later Unicode changed the standard to a BOM is > permissible, not required and not recommended. Microsoft added a BOM > to the beginning of UTF-8 files before doing so was permissible to > ease the problem of recognizing a UTF-8 file. This broke the other > operating systems' handling of UTF-8. Microsoft petitioned for the > permissibility of a BOM to avoid changing their file handling. > At this time, I believe at all programs should use Unicode and not > Microsoft code pages. I have had problems with Microsoft code pages > since MSDOS days. > Splitter and mkgmap are written in Java. Java still follows the > original Unicode standard of no BOM at the beginning of a UTF-8 text > file. This is a "not to fixed" situation per the Java language > developers. This situation results in problems with Java, > particularly in a Microsoft Windows environment, > The code fragments below provide Java solutions to writing a BOM at > the beginning of a UTF-8 text files so that Microsoft native text > editors can handle them and, on reading a text file, provides a > automatic way of ignoring an optional BOM by checking for the BOM > after file opening. > A test for execution in a Windows environment is provided below if > one decides to add a BOM only on Microsoft Windows. > I have not downloaded the splitter and mkgmap sources and searched > for the appropriate places in their sources to apply the changes. I > feel the main splitter and mkgmap developers are placed better to > make these changes. This is the reason that I did not provide patches > to the sources. > Randolph J. Herber.
- Previous message: [mkgmap-dev] New branch for default typ file
- Next message: [mkgmap-dev] New branch for default typ file
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the mkgmap-dev mailing list