Dear Hans and Taco, This may be of general interest to the Europeans (and indirectly relates to the \sh@ft email): I need the following: LATIN CAPITAL LETTER L WITH TILDE;004C 0303 LATIN SMALL LETTER L WITH TILDE;006C 0303 The proposal is still under consideration for lithuanian and not yet in unicode. In luatex can I make a definition such that such that the string U004C U0303 (l ̃) is always treated as l with tilde above, taking into account italics and without using \~l (which does not work in, eg, footnote)? Best wishes Idris -- Professor Idris Samawi Hamid, Editor-in-Chief International Journal of Shi`i Studies Department of Philosophy Colorado State University Fort Collins, CO 80523 -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
Hello Idris, I didn't see any reply to this e-mail you sent two weeks ago, so I wanted to give it a try:
In luatex can I make a definition such that such that the string
U004C U0303 (l ̃)
is always treated as l with tilde above, taking into account italics and without using \~l (which does not work in, eg, footnote)?
What you want here is to support the Unicode combining characters,
which isn't straightforward in TeX because according to the Standard,
they come after the base letter they modify, while TeX's accent commands
are, of course, typed before. So you can't simply make the combining
characters active and equivalent to the appropriate accent macros.
In traditional TeX, it would have been tempting to make the base letter
active instead, but this has a lot of drawbacks, and LuaTeX offers many
other possibilities. Here I've used a set of macros that Taco had
written a couple of months ago in response to a question by Thomas
Schmitz (see http://www.ntg.nl/pipermail/ntg-context/2007/027095.html).
The attached file implements the transformation of the sequence
Arthur Reutenauer wrote:
Hello Idris,
I didn't see any reply to this e-mail you sent two weeks ago, so I wanted to give it a try:
In luatex can I make a definition such that such that the string
U004C U0303 (l ̃)
is always treated as l with tilde above, taking into account italics and without using \~l (which does not work in, eg, footnote)?
What you want here is to support the Unicode combining characters, which isn't straightforward in TeX because according to the Standard, they come after the base letter they modify,
Which is a fairly annoying syntax for our purpose.
-- The following should check if we read ‘l’ and ‘combining tilde’ -- consecutively. A lot of overhead; it would be much prettier to -- implement a finite automaton :-)
Thanks for the reminder. We have been thinking about creating an lpeg variant that operates on tokens and/or nodes instead of simple data strings, but that will take quite a bit of work. It would be possible to simplify the loop logic by storing 'v' in a local variable, so that t[] always lags behind one value: function convert_combining(str) local l, t = { }, { } for _, v in ipairs(str) do if v[2] == 0x0303 and l[2] == 0x6c then t[#t+1] = token.create('buildtextaccent') t[#t+1] = token.create('texttilde') end if l[2] then t[#t+1] = l end l = v end if l[2] then t[#t+1] = l end return t end Best wishes, Taco
I only wanted to add a note: XeTeX always converts, say, c + combining caron into a ccaron whenever one exists in the font (and does that on a really low-level). If ccaron doesn't exist (or if there's no such comination in unicode), it simply requests both glyphs from the font (and only modern fonts have those combining glyphs, I assume). In the case of LM, the font has combining characters with zero width with the accent shifted to the left, so that it looks OK on an average glyph (but in general, TeX does a better job with combining characters) unless one requests two accents. In my opinion, LuaTeX should also be able to handle such combining characters somewhere in the early stages (I have never followed the low-level details of LuaTeX - very often "low level" means "mkiv" for LuaTeX, so probably this still means - mkiv should handle that). So, either ccaron or c+combining caron (or l+combining tilde) should behave the same way: - if there's such a glyph in the font, use it - if there is no such glyph, combine the character from c and a caron (but probably not the combining one! - different fonts have different ideas of what a combining character should be) Also, {\v x} and other strange combinations don't work in ConTeXt (I guess it does in plain TeX) since ConTeXt MK II uses a clever way to figure out if such characters exist in the font encoding, but combinations of letters and accents that are not explicitely defined that they should work, are ruled out, which is a pitty. Mojca
XeTeX always converts, say, c + combining caron into a ccaron whenever one exists in the font
Does it really? I had understood from the last discussion on the XeTeX list that it did not, with the example of capital alpha + combining breathing which was not set correctly. But maybe it's LaTeX's fault?
In the case of LM, the font has combining characters with zero width with the accent shifted to the left, so that it looks OK on an average glyph
That's a nice trick, but in the case of 'l', it looks really ugly.
(but in general, TeX does a better job with combining characters) unless one requests two accents.
Sure.
So, either ccaron or c+combining caron (or l+combining tilde) should behave the same way:
Yes, of course. This is Unicode canonical equivalence, explained in the links Idris gave in the Unicode Standard (chapter 2 is "Introduction", chapter 3 is "Conformance", and we may be concerned by chapter "Implementation guidelines", too).
Also, {\v x} and other strange combinations don't work in ConTeXt (I guess it does in plain TeX) since ConTeXt MK II uses a clever way to figure out if such characters exist in the font encoding
Mapping sequences like "\v c" to the appropriate slot in the current font encoding is quite legitimate; LaTeX does the same with its own font encodings. I didn't know it meant things like "\v x" couldn't be displayed, though. That said, it is something different from supporting combining characters. Arthur
Thanks for the reminder. We have been thinking about creating an lpeg variant that operates on tokens and/or nodes instead of simple data strings, but that will take quite a bit of work.
That sure would be nice.
It would be possible to simplify the loop logic by storing 'v' in a local variable, so that t[] always lags behind one value:
OK, thanks (but actually I was quite proud to get my code working already on the second try, so I kept it that non-optimal way ;-) Arthur
Arthur Reutenauer wrote:
Hello Idris,
I didn't see any reply to this e-mail you sent two weeks ago, so I wanted to give it a try:
In luatex can I make a definition such that such that the string
U004C U0303 (l ̃)
is always treated as l with tilde above, taking into account italics and without using \~l (which does not work in, eg, footnote)?
What you want here is to support the Unicode combining characters, which isn't straightforward in TeX because according to the Standard, they come after the base letter they modify, while TeX's accent commands are, of course, typed before. So you can't simply make the combining characters active and equivalent to the appropriate accent macros.
if i know the precise specs i can build it into the utf collapser, which is way faster than dealing with tokens (mkiv will not have a token parser for the main input, at most for dedicated tasks) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
On Sun, 13 Jan 2008 15:59:27 -0700, Hans Hagen
In luatex can I make a definition such that such that the string
U004C U0303 (l ̃)
is always treated as l with tilde above, taking into account italics and without using \~l (which does not work in, eg, footnote)?
What you want here is to support the Unicode combining characters, which isn't straightforward in TeX because according to the Standard, they come after the base letter they modify, while TeX's accent commands are, of course, typed before. So you can't simply make the combining characters active and equivalent to the appropriate accent macros.
if i know the precise specs i can build it into the utf collapser, which is way faster than dealing with tokens (mkiv will not have a token parser for the main input, at most for dedicated tasks)
This may be a good place to start: http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf see pages 48--54 don't know if this is precise enough... See also pp.~109--117 of http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf which seems even more precise. See also http://www.unicode.org/versions/Unicode5.0.0/UnicodeBookIX.pdf under "combining" Best wishes Idris -- Professor Idris Samawi Hamid, Editor-in-Chief International Journal of Shi`i Studies Department of Philosophy Colorado State University Fort Collins, CO 80523 -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
if i know the precise specs i can build it into the utf collapser
I can work that out for you, but we need to think about how to treat all this consistently, in particular with respect to the questions Mojca raised: · Equivalent sequences need to be treated the same way (c + combining caron == ccaron). · If we need to compose a glyph out of other glyphs, it may be accounted for by some OpenType feature (in particular GPOS 'mark' and 'mkmk'). · If nothing else is available, the good ol' TeX way using \accent is still valid, but we need to preserve the original Unicode data in the PDF (for searching, etc.). Arthur
Arthur Reutenauer wrote:
if i know the precise specs i can build it into the utf collapser
I can work that out for you, but we need to think about how to treat all this consistently, in particular with respect to the questions Mojca raised:
· Equivalent sequences need to be treated the same way (c + combining caron == ccaron). · If we need to compose a glyph out of other glyphs, it may be accounted for by some OpenType feature (in particular GPOS 'mark' and 'mkmk'). · If nothing else is available, the good ol' TeX way using \accent is still valid, but we need to preserve the original Unicode data in the PDF (for searching, etc.).
(1) mkiv already has (actually it was one of the first thing simplemented) an utf composition handler; this one is initialized using the big char table which has information about the formal composition sequences an option is to add more to this (like the lcaron); fo rthose who want to play with it i added a command (beta upload) \definecomposedutf 318 108 126 % lcaron keep in mind that this acts on the input, so it may mess up definitions that contain l~ sequences; any input processing cq. token processing (later stage) is kind of dangerous (2) it is possible (but no handy interface yet, i may make it a 'context' font feature) to complete a font with all it's composed char susing virtual fotn trickery (see mk.pdf) which resolves the missing glyph issue (3) letter this year (after mplib) we will pick up a 'glyph not present in font' callback that's on our agenda (4) another option is to deal with it in the node list handlers, but if possible i want to avoid this (the more passes, the slower) [there is already quite some framework present in mikv, but not always interfaced; much of this is also used in performance testing and such and some is reported in mk.pdf] ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
participants (5)
-
Arthur Reutenauer
-
Hans Hagen
-
Idris Samawi Hamid
-
Mojca Miklavec
-
Taco Hoekwater