To: emacs@naggum.no
Subject: Emacs sans MULE
MIME-Version: 1.0
Content-type: text/plain; charset=iso-8859-1
Content-transfer-encoding: 8bit
From: nisse@lysator.liu.se (Niels Möller)
Date: 22 Sep 1997 04:38:37 +0200
Message-ID: <nnafh6vzrm.fsf@tindra.lysator.liu.se>
X-Mailer: Gnus v5.4.59/Emacs 19.34
Lines: 69
Xref: tindra.lysator.liu.se archive:886

Hello.

I've just read your Emacs pages at http://www.naggum.no, in particular
about getting rid of MULE. I'm afraid I won't have enough time or
competence to write much code, but perhaps you'll hear my input
anyway.

I've only tried emacs-20 with MULE a little, a few weeks back when I
needed to mix source code and Russian text. I quite liked the surface
of it, the language environments, coding systems and input
translations seemed to do what I needed. Thus, I would not oppose MULE
only because it's a little slower and a bigger than the emacs-19. If
it only solved the character set problems cleanly.

But appearantly it does not. It assigns completely non-standard codes
to latin1 characters, and it breaks the length function badly.

As I see it, the Right way to handle larger character sets in emacs is
to use the following quite simple abstractions:

× A character is a character is an integer. Always in Unicode, or the
corresponding ISO standard. 16, or 24 bits (32 is not suitable as it's
larger than a lisp integer on most machines). I don't see any reason
to make characters more abstract than this, unless you want characters
to have text properties, or something like that.

× An eight-bit character set is a mapping of the codes 0-255 into
unicode and back. There's no fixed set of supported character sets,
                 256
all the    65536^--- character sets are possible. The main use for
character sets is when reading or writing strings to the outside world
(i.e. the coding systems of emacs-20). The default character set for
eight-bit lisp source should be latin1 (although it should be possible
to use other eight-bit character sets or unicode, if one so wishes,
and symbol names should be unicode internally).

× By default, buffers and strings are sequences of unicode characters.
This makes arithmetic on string lengths and buffer positions work
efficiently and as expected, not in the broken way MULE does it.

× As an optimization, strings and buffers that consists exclusively of
characters present in one particular eight-bit character set could be
stored using eight bits per character. Such a string or buffer is
associated with a character set. I would prefer all translations
between this optimized representation and unicode to be completely
transparent. Inserting a string into a buffer automatically translates
the string into the character set used for the buffer, or signals an
error if that is not possible. Indexing a string returns a unicode
integer. I realize that there may be reasons to allow the optimized
representation to be accessed, but I think that any needed functions
to do that can be added without much trouble. And regardless of the
character size used, buffers and strings are *always* sequences of
characters, not bytes.

To me, the first three points are the most important. Optimization of
buffers needing only eight bits per character would be nice, and is
certainly possible, but is not of primary importance.

I would think that these low-level primitives are all that are needed
to do the high-level stuff of MULE, properly. Question is, how much
work would it take?


For the more long term future of emacs, I think a merge with guile is
also an attractive alternative. Even if guile currently lacks a decent
compiler.

Best regards,
/Niels Möller