Path: news.daimi.aau.dk!not-for-mail From: John Cowan Newsgroups: comp.lang.beta Subject: Loki Paper 2: Character Sets Date: Thu, 19 Mar 1998 21:34:24 +0100 (MET) Organization: DAIMI, Computer Science Dept. at Aarhus University Lines: 57 Approved: mailtonews@daimi.aau.dk Distribution: world Message-ID: <199803192034.VAA21366@noatun.mjolner.dk> Reply-To: "John Cowan NNTP-Posting-Host: daimi.daimi.aau.dk Xref: news.daimi.aau.dk comp.lang.beta:11470 The character set used for Java programming is a large subset of the 16-bit Unicode (ISO 10646) character set. The Mjolner compiler allows only ASCII characters. The ASCII character set is embedded in Unicode, as is the Latin-1 (ISO 8859-1) character set, so that upward compatibility is maintained. This allows non-English-speaking programmers to write identifier names, comments, and text strings belonging to their own languages. Loki will extend this ability to Beta programmers as well. A trivial change to the Mjolner compiler (not involving extending it to Unicode!) will permit easy interchange between Mjolner and Loki Beta programs. The Java compiler accepts programs in one of two transformation formats: UTF-8 and Unicode escape mode. Both of these have the useful property that ASCII characters are represented by themselves, so that ASCII-only programs are immediately compatible. UTF-8 is sufficiently documented elsewhere (see http://www.unicode.org), and I will simply say that Loki will accept it. Unicode escape mode is more interesting. Every character outside the ASCII range is represented by the sequence "\uxxxx" where "xxxx" is four hexadecimal digits. These sequences are interpreted immediately on reading in the source code, and thus they may be used anywhere: in identifiers, comments, or strings. It is legal in Java to use values of "xxxx" that represent an ASCII character (0000-007f), but I propose to forbid this usage in Beta code. To the Java compiler, "\u002c" is equivalent to a comma in every way: it can be used to separate arguments in a method call or for any other purpose. This usage makes for nothing but confusion to the reader. Provided with the Java Development Kit (and therefore easily available) is a tool called "native2ascii" written in Java. This tool accepts input files written in any of a variety of character sets and outputs them in Unicode-escape-sequence form. This permits programmers to write Java source in the most suitable character set (Latin-1 for Western Europeans, Shift-JIS for Japanese, or whatever) and automatically transform it into a form suitable for the Java compiler. The same tool would be usable for Beta source. The character "\" is not used in the Beta language at present. If it were to be defined to the Mjolner compiler as an alphabetic character, then identifiers like "Mj\u00f8lner" would be acceptable to it. Although this sequence would appear as a 12-character identifier to the Mjolner compiler and a 7-character identifier to the Loki compiler, source code could still be passed between the two systems with few problems. Only strings containing Unicode escapes would be an issue for the Mjolner system, which has no run-time support for them anyway. The problem then arises of what to do with Mjolner source code that uses "\" in non-standard ways; I propose not to worry about the issue at present. Ideally, the Mjolner compiler would check that every "\" was followed by a "u" character and exactly four hexadecimal digits in the range 00a0-00fd. -- John Cowan cowan@ccil.org e'osai ko sarji la lojban.