active strings in luatex?

Idris Samawi Hamid

28 Dec 2007 28 Dec '07

7:29 p.m.

Dear Hans and Taco, This may be of general interest to the Europeans (and indirectly relates to the \sh@ft email): I need the following: LATIN CAPITAL LETTER L WITH TILDE;004C 0303 LATIN SMALL LETTER L WITH TILDE;006C 0303 The proposal is still under consideration for lithuanian and not yet in unicode. In luatex can I make a definition such that such that the string U004C U0303 (l ̃) is always treated as l with tilde above, taking into account italics and without using \~l (which does not work in, eg, footnote)? Best wishes Idris -- Professor Idris Samawi Hamid, Editor-in-Chief International Journal of Shi`i Studies Department of Philosophy Colorado State University Fort Collins, CO 80523 -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Show replies by date

Arthur Reutenauer

13 Jan 13 Jan

4:31 a.m.

Hello Idris, I didn't see any reply to this e-mail you sent two weeks ago, so I wanted to give it a try:

...

In luatex can I make a definition such that such that the string

U004C U0303 (l ̃)

is always treated as l with tilde above, taking into account italics and without using \~l (which does not work in, eg, footnote)?

What you want here is to support the Unicode combining characters, which isn't straightforward in TeX because according to the Standard, they come after the base letter they modify, while TeX's accent commands are, of course, typed before. So you can't simply make the combining characters active and equivalent to the appropriate accent macros. In traditional TeX, it would have been tempting to make the base letter active instead, but this has a lot of drawbacks, and LuaTeX offers many other possibilities. Here I've used a set of macros that Taco had written a couple of months ago in response to a question by Thomas Schmitz (see http://www.ntg.nl/pipermail/ntg-context/2007/027095.html). The attached file implements the transformation of the sequence in "\buildtextaccent\texttilde l", which I hope gives the expected result in every circumstance. I've done it only for the small letter, but of course it's easy to adapt to add the capital letter as well. Finally, I wish to clarify a small misunderstanding: you quoted the two lines below: LATIN CAPITAL LETTER L WITH TILDE;004C 0303 LATIN SMALL LETTER L WITH TILDE;006C 0303 with the comment "The proposal is still under consideration for Lithuanian and not yet in Unicode". Actually it is already encoded in Unicode; that is, all the characters you need are present with the appropriate semantics, and you can accurately represent a small l with tilde in Unicode; only, you have to use two characters (U+006C followed by U+0303). The only thing that will be added to Unicode in that respect is the *name* of those strings (I guess you took those two lines from the data files for Unicode version 5.1.0, in beta stage). The corresponding characters, though, will not be added to Unicode, according to a decision which has been made several years ago (I could trace it back to a discussion at the Unicode Technical Committee in October 1999, but I don't know the details). The idea is that it can already be represented as a sequence of characters, and the Unicode Consortium does not wish to make the set of alphabetic characters explode with diacritics. In spite of this, Unicode still wishes to acknowledge that some unencoded accented letters are important in some languages, and provides names for the character sequences representing them, like it does for all the encoded characters. The relevant document that explains this is Unicode Standard Annex #34 (http://www.unicode.org/reports/tr34/). Arthur

Taco Hoekwater

12:18 p.m.

Arthur Reutenauer wrote:

...

Hello Idris,

I didn't see any reply to this e-mail you sent two weeks ago, so I wanted to give it a try:

...
In luatex can I make a definition such that such that the string

U004C U0303 (l ̃)

is always treated as l with tilde above, taking into account italics and without using \~l (which does not work in, eg, footnote)?

What you want here is to support the Unicode combining characters, which isn't straightforward in TeX because according to the Standard, they come after the base letter they modify,

Which is a fairly annoying syntax for our purpose.

...

-- The following should check if we read ‘l’ and ‘combining tilde’ -- consecutively. A lot of overhead; it would be much prettier to -- implement a finite automaton :-)

Thanks for the reminder. We have been thinking about creating an lpeg variant that operates on tokens and/or nodes instead of simple data strings, but that will take quite a bit of work. It would be possible to simplify the loop logic by storing 'v' in a local variable, so that t[] always lags behind one value: function convert_combining(str) local l, t = { }, { } for _, v in ipairs(str) do if v[2] == 0x0303 and l[2] == 0x6c then t[#t+1] = token.create('buildtextaccent') t[#t+1] = token.create('texttilde') end if l[2] then t[#t+1] = l end l = v end if l[2] then t[#t+1] = l end return t end Best wishes, Taco

Mojca Miklavec

3:38 p.m.

I only wanted to add a note: XeTeX always converts, say, c + combining caron into a ccaron whenever one exists in the font (and does that on a really low-level). If ccaron doesn't exist (or if there's no such comination in unicode), it simply requests both glyphs from the font (and only modern fonts have those combining glyphs, I assume). In the case of LM, the font has combining characters with zero width with the accent shifted to the left, so that it looks OK on an average glyph (but in general, TeX does a better job with combining characters) unless one requests two accents. In my opinion, LuaTeX should also be able to handle such combining characters somewhere in the early stages (I have never followed the low-level details of LuaTeX - very often "low level" means "mkiv" for LuaTeX, so probably this still means - mkiv should handle that). So, either ccaron or c+combining caron (or l+combining tilde) should behave the same way: - if there's such a glyph in the font, use it - if there is no such glyph, combine the character from c and a caron (but probably not the combining one! - different fonts have different ideas of what a combining character should be) Also, {\v x} and other strange combinations don't work in ConTeXt (I guess it does in plain TeX) since ConTeXt MK II uses a clever way to figure out if such characters exist in the font encoding, but combinations of letters and accents that are not explicitely defined that they should work, are ruled out, which is a pitty. Mojca

Arthur Reutenauer

14 Jan 14 Jan

1:20 a.m.

...

XeTeX always converts, say, c + combining caron into a ccaron whenever one exists in the font

Does it really? I had understood from the last discussion on the XeTeX list that it did not, with the example of capital alpha + combining breathing which was not set correctly. But maybe it's LaTeX's fault?

...

In the case of LM, the font has combining characters with zero width with the accent shifted to the left, so that it looks OK on an average glyph

That's a nice trick, but in the case of 'l', it looks really ugly.

...

(but in general, TeX does a better job with combining characters) unless one requests two accents.

Sure.

...

So, either ccaron or c+combining caron (or l+combining tilde) should behave the same way:

Yes, of course. This is Unicode canonical equivalence, explained in the links Idris gave in the Unicode Standard (chapter 2 is "Introduction", chapter 3 is "Conformance", and we may be concerned by chapter "Implementation guidelines", too).

...

Also, {\v x} and other strange combinations don't work in ConTeXt (I guess it does in plain TeX) since ConTeXt MK II uses a clever way to figure out if such characters exist in the font encoding

Mapping sequences like "\v c" to the appropriate slot in the current font encoding is quite legitimate; LaTeX does the same with its own font encodings. I didn't know it meant things like "\v x" couldn't be displayed, though. That said, it is something different from supporting combining characters. Arthur

Arthur Reutenauer

13 Jan 13 Jan

7:51 p.m.

...

Thanks for the reminder. We have been thinking about creating an lpeg variant that operates on tokens and/or nodes instead of simple data strings, but that will take quite a bit of work.

That sure would be nice.

...

It would be possible to simplify the loop logic by storing 'v' in a local variable, so that t[] always lags behind one value:

OK, thanks (but actually I was quite proud to get my code working already on the second try, so I kept it that non-optimal way ;-) Arthur

Hans Hagen

11:59 p.m.

Arthur Reutenauer wrote:

...

Hello Idris,

I didn't see any reply to this e-mail you sent two weeks ago, so I wanted to give it a try:

...
In luatex can I make a definition such that such that the string

U004C U0303 (l ̃)

is always treated as l with tilde above, taking into account italics and without using \~l (which does not work in, eg, footnote)?

What you want here is to support the Unicode combining characters, which isn't straightforward in TeX because according to the Standard, they come after the base letter they modify, while TeX's accent commands are, of course, typed before. So you can't simply make the combining characters active and equivalent to the appropriate accent macros.

if i know the precise specs i can build it into the utf collapser, which is way faster than dealing with tokens (mkiv will not have a token parser for the main input, at most for dedicated tasks) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Idris Samawi Hamid

14 Jan 14 Jan

12:25 a.m.

On Sun, 13 Jan 2008 15:59:27 -0700, Hans Hagen wrote:

...

...
...
In luatex can I make a definition such that such that the string

U004C U0303 (l ̃)

is always treated as l with tilde above, taking into account italics and without using \~l (which does not work in, eg, footnote)?

What you want here is to support the Unicode combining characters, which isn't straightforward in TeX because according to the Standard, they come after the base letter they modify, while TeX's accent commands are, of course, typed before. So you can't simply make the combining characters active and equivalent to the appropriate accent macros.

if i know the precise specs i can build it into the utf collapser, which is way faster than dealing with tokens (mkiv will not have a token parser for the main input, at most for dedicated tasks)

This may be a good place to start: http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf see pages 48--54 don't know if this is precise enough... See also pp.~109--117 of http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf which seems even more precise. See also http://www.unicode.org/versions/Unicode5.0.0/UnicodeBookIX.pdf under "combining" Best wishes Idris -- Professor Idris Samawi Hamid, Editor-in-Chief International Journal of Shi`i Studies Department of Philosophy Colorado State University Fort Collins, CO 80523 -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Arthur Reutenauer

12:30 a.m.

...

if i know the precise specs i can build it into the utf collapser

I can work that out for you, but we need to think about how to treat all this consistently, in particular with respect to the questions Mojca raised: · Equivalent sequences need to be treated the same way (c + combining caron == ccaron). · If we need to compose a glyph out of other glyphs, it may be accounted for by some OpenType feature (in particular GPOS 'mark' and 'mkmk'). · If nothing else is available, the good ol' TeX way using \accent is still valid, but we need to preserve the original Unicode data in the PDF (for searching, etc.). Arthur

Hans Hagen

9:59 a.m.

Arthur Reutenauer wrote:

...

...
if i know the precise specs i can build it into the utf collapser

I can work that out for you, but we need to think about how to treat all this consistently, in particular with respect to the questions Mojca raised:

· Equivalent sequences need to be treated the same way (c + combining caron == ccaron). · If we need to compose a glyph out of other glyphs, it may be accounted for by some OpenType feature (in particular GPOS 'mark' and 'mkmk'). · If nothing else is available, the good ol' TeX way using \accent is still valid, but we need to preserve the original Unicode data in the PDF (for searching, etc.).

(1) mkiv already has (actually it was one of the first thing simplemented) an utf composition handler; this one is initialized using the big char table which has information about the formal composition sequences an option is to add more to this (like the lcaron); fo rthose who want to play with it i added a command (beta upload) \definecomposedutf 318 108 126 % lcaron keep in mind that this acts on the input, so it may mess up definitions that contain l~ sequences; any input processing cq. token processing (later stage) is kind of dangerous (2) it is possible (but no handy interface yet, i may make it a 'context' font feature) to complete a font with all it's composed char susing virtual fotn trickery (see mk.pdf) which resolves the missing glyph issue (3) letter this year (after mplib) we will pick up a 'glyph not present in font' callback that's on our agenda (4) another option is to deal with it in the node list handlers, but if possible i want to avoid this (the more passes, the slower) [there is already quite some framework present in mikv, but not always interfaced; much of this is also used in performance testing and such and some is reported in mk.pdf] ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

6268

Age (days ago)

6285

Last active (days ago)

List overview

Download

9 comments

5 participants

participants (5)

Arthur Reutenauer
Hans Hagen
Idris Samawi Hamid
Mojca Miklavec
Taco Hoekwater

active strings in luatex?

tags

participants (5)