Path: news.daimi.aau.dk!not-for-mail
From: John Cowan <cowan@locke.ccil.org>
Newsgroups: comp.lang.beta
Subject: Loki Paper 2: Character Sets
Date: Thu, 19 Mar 1998 21:34:24 +0100 (MET)
Organization: DAIMI, Computer Science Dept. at Aarhus University
Lines: 57
Approved: mailtonews@daimi.aau.dk
Distribution: world
Message-ID: <199803192034.VAA21366@noatun.mjolner.dk>
Reply-To: "John Cowan <cowan@locke.ccil.org>
NNTP-Posting-Host: daimi.daimi.aau.dk
Xref: news.daimi.aau.dk comp.lang.beta:11470

The character set used for Java programming is a large subset of the
16-bit Unicode (ISO 10646) character set.  The Mjolner compiler allows
only ASCII characters.  The ASCII character set is embedded in Unicode,
as is the Latin-1 (ISO 8859-1) character set, so that upward
compatibility is maintained.  This allows non-English-speaking
programmers to write identifier names, comments, and text strings
belonging to their own languages.  Loki will extend this ability to Beta
programmers as well.  A trivial change to the Mjolner compiler (not
involving extending it to Unicode!) will permit easy interchange between
Mjolner and Loki Beta programs.

The Java compiler accepts programs in one of two transformation formats:
UTF-8 and Unicode escape mode.  Both of these have the useful property
that ASCII characters are represented by themselves, so that ASCII-only
programs are immediately compatible.  UTF-8 is sufficiently documented
elsewhere (see http://www.unicode.org), and I will simply say that Loki
will accept it.

Unicode escape mode is more interesting.  Every character outside the
ASCII range is represented by the sequence "\uxxxx" where "xxxx" is four
hexadecimal digits.  These sequences are interpreted immediately on
reading in the source code, and thus they may be used anywhere: in
identifiers, comments, or strings.  It is legal in Java to use values of
"xxxx" that represent an ASCII character (0000-007f), but I propose to
forbid this usage in Beta code.  To the Java compiler, "\u002c" is
equivalent to a comma in every way: it can be used to separate arguments
in a method call or for any other purpose.  This usage makes for nothing
but confusion to the reader.

Provided with the Java Development Kit (and therefore easily available)
is a tool called "native2ascii" written in Java.  This tool accepts
input files written in any of a variety of character sets and outputs
them in Unicode-escape-sequence form.  This permits programmers to write
Java source in the most suitable character set (Latin-1 for Western
Europeans, Shift-JIS for Japanese, or whatever) and automatically
transform it into a form suitable for the Java compiler.  The same tool
would be usable for Beta source.

The character "\" is not used in the Beta language at present.  If it
were to be defined to the Mjolner compiler as an alphabetic character,
then identifiers like "Mj\u00f8lner" would be acceptable to it.
Although this sequence would appear as a 12-character identifier to the
Mjolner compiler and a 7-character identifier to the Loki compiler,
source code could still be passed between the two systems with few
problems.  Only strings containing Unicode escapes would be an issue for
the Mjolner system, which has no run-time support for them anyway.

The problem then arises of what to do with Mjolner source code that uses
"\" in non-standard ways; I propose not to worry about the issue at
present.  Ideally, the Mjolner compiler would check that every "\" was
followed by a "u" character and exactly four hexadecimal digits in the
range 00a0-00fd.

-- 
John Cowan					cowan@ccil.org
		e'osai ko sarji la lojban.