valid Unicode character treatment?
Hi, I've been wondering about several things with regard to Unicode/utf-8. What is the situation in fonts (of any type) in general concerning the non-existing Unicode codepoints corresponding to the 16-bit surrogate codes D800-DFFF? Are there any fonts actually putting stuff there, if only ligatures? A compliant utf-8 file is not supposed to contain any codes in that area, so we would not want to have them appear as part of "overfull hbox" messages and similar if I am not mistaken. I am currently thinking about a utf-8 strategy that would be least prone to causing internal inconsistencies: basically I think that certain properties of _legal_ utf-8 should be guaranteed inside of LuaTeX, like that the number of characters in a string being equal to the number of bytes outside of the 80-BF code range, that characters are encoded with minimal length, that the number of bytes never exceeds 4 times the number of characters and similar things. I'd tend to move the special "output in byte-sized chunks" characters to "11xxxx: after all, fonts may contain stuff in the "10ffxx area, and overfull box messages will output those characters. Those can be represented internally by (basically out of range) utf-8 sequences in the obvious way. If the input reader for utf-8 cranks out the corresponding "output in byte-sized chunks" characters for illegal utf-8 byte sequences, then inputting them accidentally will usually lead to "missing character" errors, but it will be possible to write stuff like \message{^^11xxxx} to produce verbatim output, and \message{illegal byte sequence} will reproduce the illegal byte sequence unchanged, without having any illegal byte sequence (apart from the codes for "11xxxx) present in the innards of LuaTeX. I'd think it reasonable not to permit those characters "11xxxx into the normal character arrays (lccode, uccode, chardef ...) in a manner similar to how the codes from "80 to "ff were treated in TeX-2.x (which had 7-bit arrays inside, but accepted 256 characters in fonts and input). When I write "11xxxx instead of "1100xx it is because I don't yet have a clear idea about whether or how one would bother thinking about transparent word output when using UCS-16 (which has surrogate characters and stuff). Possibly one should just completely forget about facilitating UCS-16 output, whether through callbacks or otherwise. Ok, this is just a sketch (I consider the prospect disturbing of having to code without being able to rely internally on legal utf-8 sequences as long as possibly involved callbacks are bugfree), but the main question of this posting was what how surrogate code points are treated in fonts. Thanks, David -- David Kastrup
Hi, David Kastrup wrote:
Hi, I've been wondering about several things with regard to Unicode/utf-8.
What is the situation in fonts (of any type) in general concerning the non-existing Unicode codepoints corresponding to the 16-bit surrogate codes D800-DFFF? Are there any fonts actually putting stuff there, if only ligatures?
I don't think I have ever seen one, but there could be. There is nothing in a 16-bit encoded font that forces it to use Unicode, after all. Just like 8-bit encoding does not enforce ASCII. In any case, the overful messages should not output UTF-8 sequences, because those are not characters, but glyphs, and that has been the case ever since 7-bit TeX82. I intend to switch to number representation for all glyphs that do not have a Unicode code point assigned, and that should fix the whole issue finally.
A compliant utf-8 file is not supposed to contain any codes in that area, so we would not want to have them appear as part of "overfull hbox" messages and similar if I am not mistaken.
But should we really bother testing against that? Who will care, except people that want theoretical perfection? I certainly don't. As long as a UTF-8 sequence can be transformed to an integer that fits the acceptable range, that is good enough.
I'd tend to move the special "output in byte-sized chunks" characters to "11xxxx: after all, fonts may contain stuff in the "10ffxx area,
Yes, you are right, they could. Switching to something that is completely out-of-range is not such a bad idea.
Ok, this is just a sketch (I consider the prospect disturbing of having to code without being able to rely internally on legal utf-8 sequences as long as possibly involved callbacks are bugfree), but the
Junk in, junk out. LuaTeX is not a file format validator, but a typesetting engine. That is what I think, anyway.
main question of this posting was what how surrogate code points are treated in fonts.
The input has characters, and these are transformed into glyphs. The input should adhere to UTF-8 conventions, but the font encoding doesn't have to. Even if this is not all true *right now*, it will so be before the official release. Is that clear enough? Best, Taco
Taco Hoekwater
David Kastrup wrote:
Hi, I've been wondering about several things with regard to Unicode/utf-8.
[...] Thanks for the explanations. Certainly reassuring.
Junk in, junk out. LuaTeX is not a file format validator, but a typesetting engine. That is what I think, anyway.
I have no problem with "junk in, junk out": after all, that was basically what my proposal of turning invalid input bytes into the transparent output characters intended to achieve. What I wanted to avoid is "junk in, crash out". Or "junk in, anything may happen". After all, "anything may happen" can imply a security risk. It would be nice if the output of Lua callbacks could not cause memory corruption or similar. If LuaTeX were to work only with utf-8 sequences it had generated itself from character codes (or some equivalent process providing those minimal guarantees about the byte sequences passed into LuaTeX that LuaTeX needs for efficient operation), this would be helpful. I know that one can crash a TeX executable with things like \def~{\if~}~ but those "just" cause a stack overflow and don't form a security risk on typical architectures (some DOS TeXs protect explicitly against this IIRC). -- David Kastrup
participants (2)
-
David Kastrup
-
Taco Hoekwater