To: emacs@naggum.no Subject: Emacs sans MULE MIME-Version: 1.0 Content-type: text/plain; charset=iso-8859-1 Content-transfer-encoding: 8bit From: nisse@lysator.liu.se (Niels Möller) Date: 22 Sep 1997 04:38:37 +0200 Message-ID: X-Mailer: Gnus v5.4.59/Emacs 19.34 Lines: 69 Xref: tindra.lysator.liu.se archive:886 Hello. I've just read your Emacs pages at http://www.naggum.no, in particular about getting rid of MULE. I'm afraid I won't have enough time or competence to write much code, but perhaps you'll hear my input anyway. I've only tried emacs-20 with MULE a little, a few weeks back when I needed to mix source code and Russian text. I quite liked the surface of it, the language environments, coding systems and input translations seemed to do what I needed. Thus, I would not oppose MULE only because it's a little slower and a bigger than the emacs-19. If it only solved the character set problems cleanly. But appearantly it does not. It assigns completely non-standard codes to latin1 characters, and it breaks the length function badly. As I see it, the Right way to handle larger character sets in emacs is to use the following quite simple abstractions: × A character is a character is an integer. Always in Unicode, or the corresponding ISO standard. 16, or 24 bits (32 is not suitable as it's larger than a lisp integer on most machines). I don't see any reason to make characters more abstract than this, unless you want characters to have text properties, or something like that. × An eight-bit character set is a mapping of the codes 0-255 into unicode and back. There's no fixed set of supported character sets, 256 all the 65536^--- character sets are possible. The main use for character sets is when reading or writing strings to the outside world (i.e. the coding systems of emacs-20). The default character set for eight-bit lisp source should be latin1 (although it should be possible to use other eight-bit character sets or unicode, if one so wishes, and symbol names should be unicode internally). × By default, buffers and strings are sequences of unicode characters. This makes arithmetic on string lengths and buffer positions work efficiently and as expected, not in the broken way MULE does it. × As an optimization, strings and buffers that consists exclusively of characters present in one particular eight-bit character set could be stored using eight bits per character. Such a string or buffer is associated with a character set. I would prefer all translations between this optimized representation and unicode to be completely transparent. Inserting a string into a buffer automatically translates the string into the character set used for the buffer, or signals an error if that is not possible. Indexing a string returns a unicode integer. I realize that there may be reasons to allow the optimized representation to be accessed, but I think that any needed functions to do that can be added without much trouble. And regardless of the character size used, buffers and strings are *always* sequences of characters, not bytes. To me, the first three points are the most important. Optimization of buffers needing only eight bits per character would be nice, and is certainly possible, but is not of primary importance. I would think that these low-level primitives are all that are needed to do the high-level stuff of MULE, properly. Question is, how much work would it take? For the more long term future of emacs, I think a merge with guile is also an attractive alternative. Even if guile currently lacks a decent compiler. Best regards, /Niels Möller