(again) index sorting of accented characters
Dear list, sorry for bothering again with this issue, but I need to have indices in my documents. I have the following sample: \mainlanguage[es] \setupregister[method=default] \starttext \startTEXpage[offset=1em] \index{ámame}\index{arisco}\index{ándrago} \index{antonia}\index{antón} \placeindex \stopTEXpage \stoptext Word sorting is the following: antonia antón arisco ámame ándrago Right word order is: ámame ándrago antón antonia arisco In Spanish, as in other languages, an accented letter has no different sorting that its unaccented counterpart. I got the right word order adding these replacements in sort-lan.lua: replacements = { { "á", "a" }, { "é", "i"}, { "í", "i" }, { "ó", "o"}, { "ú", "u" }, { "ü", "u" }, }, Could anyone explain me whether this is the right way of doing it? I mean, if this is the way, I have other two patches for other two languages in which I have indices. And if I’m wrong, I would like to know how to get right word sorting in registers. Many thanks for your help, Pablo -- http://www.ousia.tk
On 04/27/2017 07:21 PM, Pablo Rodriguez wrote:
I mean, if this is the way, I have other two patches for other two languages in which I have indices.
And if I’m wrong, I would like to know how to get right word sorting in registers.
Have you played with the different "methods" defined in sort-ini.lua, lines 96-103? Thomas
On 04/27/2017 08:51 PM, Thomas A. Schmitz wrote:
On 04/27/2017 07:21 PM, Pablo Rodriguez wrote:
I mean, if this is the way, I have other two patches for other two languages in which I have indices.
And if I’m wrong, I would like to know how to get right word sorting in registers.
Have you played with the different "methods" defined in sort-ini.lua, lines 96-103?
Many thanks for your reply, Thomas. The right values seem to be {zm, zc}. This works fine with Spanish and French, but ancient Greek is more problematic. I have a source file, http://www.ousia.tk/grc-index.tex. Standard sorting gives the following results http://www.ousia.tk/grc-index-standard.pdf#page=3. When I add replacements (http://www.ousia.tk/grc-replacements.diff) to sort-lan.lua, sorting order is right (http://www.ousia.tk/grc-index-modified.pdf#page=3). Could you please confirm the issue? Many thanks for your help, Pablo -- http://www.ousia.tk
On 04/27/2017 10:26 PM, Pablo Rodriguez wrote:
Could you please confirm the issue?
Many thanks for your help,
Two remarks: 1. I'm not sure what you're looking for. Do you really want an index that sorts every form of every word as an entry? So that ἐμήν and ἐμοῖς are different words and not occurrences of the same entry? If that's really what you're looking for, you may want to look into a very handy luatex function: characters.shaped() returns the unaccented characters of a unicode string, see chapter 11.2 of cld-mkiv.pdf. Define your own command that uses this lua function to index the unaccented word; that's not too hard. 2. If, on the other hand, you want to build a real index that will sort morphological forms under their head words, you will have to give the sort term explicitly, and then you don't have to rely on ConTeXt's abilities to sort accented Greek because you will have something like ἐμήν\index{εμοσ} in your text. For the time being, there's no software that can reliably parse ancient Greek, I'm afraid. Thomas
On 04/27/2017 11:08 PM, Thomas A. Schmitz wrote:
Two remarks:
1. I'm not sure what you're looking for.
Sorry, Thomas, it is a question on pure word order. No correction in form selection for any existing or possible index. This is my sample: \setupbodyfont[dejavu] \setupregister[language=gr, method={zm, zc}] \starttext \startTEXpage[offset=2em] \index{ἁμαρτάνω} \index{ἁστρονόμος} % I know breathing is wrong (only for testing) \index{Ἀπόλλων} \index{Ἀσκληπιός} \index{ἅπαξ} \index{αἰεί} \index{πᾶσα} \index{πᾶς} \placeindex \stopTEXpage \stoptext This is the word order I get with current beta: Ἀπόλλων Ἀσκληπιός αἰεί ἁμαρτάνω ἁστρονόμος ἅπαξ πᾶσα πᾶς Right word order should be: αἰεί ἁμαρτάνω ἅπαξ Ἀπόλλων Ἀσκληπιός ἁστρονόμος πᾶς πᾶσα To get the right sorting, I have to apply the following patch: http://www.ousia.tk/grc-replacements.diff. These replacements are required, from what I understand, because with the default settings "ἀ" is replaced with "αf", "α" with "αa", "ἁ" with "αg" and "ἅ" with "αl". I don‘t know why "α" isn’t the first in sorting, but it is clear that letters with different diacritical marks are considered as different letters for word sorting. Could you confirm that the right word order is the second list in this message instead of the first one that ConTeXt generates by default? Many thanks for your help, Pablo -- http://www.ousia.tk
On 29. Apr 2017, at 13:10, Pablo Rodriguez
wrote: I don‘t know why "α" isn’t the first in sorting, but it is clear that letters with different diacritical marks are considered as different letters for word sorting.
Could you confirm that the right word order is the second list in this message instead of the first one that ConTeXt generates by default?
No, I don't see why yours should be “right” and the order that is produced now should be “wrong.” It really depends on the purpose of your sorting. I don’t know who contributed the current code to sort-ini.lua, but it makes consistent choices and produces a possible order. It’s not the order you would prefer, granted. It’s not the order I would prefer, granted again. But what is the purpose of pushing so hard to have your favorite order included as default? You know what to do to have this order, and that’s all that’s important for you. Other users may have different priorities (witness the long list in sort-ini.lua: someone went to great lengths to define this order). So I still don’t see what you’re trying to accomplish. Thomas
On 04/29/2017 01:42 PM, Schmitz Thomas A. wrote:
Could you confirm that the right word order is the second list in this message instead of the first one that ConTeXt generates by default?
No, I don't see why yours should be “right” and the order that is produced now should be “wrong.” It really depends on the purpose of your sorting.
Sorry, Thomas, I’m afraid I don’t get your point here. I mean, if alphabetic sorting makes any sense at all, this is to sort index and dictionary entries. (If not, please tell me what I am missing here.) Imagine that LSJ is edited again in 2017. You purchase the paper edition and you notice that vowels are sorted considering their diacritics too (resulting in cases such as ἅλς placed after ἁμαρτία). Wouldn‘t you think that sorting is somehow “flawed” in that new edition? The point I’m trying to make is that this isn’t about my personal preferences, but about conventions used for centuries.
I don’t know who contributed the current code to sort-ini.lua, but it makes consistent choices and produces a possible order. It’s not the order you would prefer, granted. It’s not the order I would prefer, granted again. Hans kindly provided them (https://mailman.ntg.nl/pipermail/ntg-context/2017/088340.html) to a request of mine.
After using it, I realized that word order wasn’t right, although I didn’t understand why replacements were needed. Or why replacements required an unaccented Greek vowel and a Latin letter. Hans replied that some order was always needed. I needed more samples to realize that this wasn’t what I wanted.
But what is the purpose of pushing so hard to have your favorite order included as default? You know what to do to have this order, and that’s all that’s important for you. This isn’t about my favorite order. This is about indices of (ancient) Greek names or words.
German has five sorting criteria (de, Duden, two DIN and de-AT), but why is the default criterium to sort (ancient) Greek foreign to practice over centuries? That being said, I don’t that Hans intended to establish an new sorting criterium. The whole problem was that I couldn’t explain this issue better.
Other users may have different priorities (witness the long list in sort-ini.lua: someone went to great lengths to define this order). So I still don’t see what you’re trying to accomplish.
An index with classical Greek words (or names) that follows the same principle as in German, English or Dutch: word sorting is the same as in most important dictionaries. This is the main reason of having it as a default. I hope it is clear now, Pablo -- http://www.ousia.tk
On 29. Apr 2017, at 16:51, Pablo Rodriguez
wrote: An index with classical Greek words (or names) that follows the same principle as in German, English or Dutch: word sorting is the same as in most important dictionaries.
This is the main reason of having it as a default.
Sorry, but I still don't see your point here. 1. You refer to “practice over centuries.” Can you point me to a traditional index that contains an entry such as ἐκτὸς (from your example file)? 2. Sorting in ConTeXt may be used for more purposes than for printed books. I use it to analyze the content of my TEI xml files. 3. You refer to a new edition of LSJ in 2017. If it sorted words the way you suggest, with every possible morphological form as its own entry, you would think that was “flawed” too. So I’m sorry, I still don’t get your point: this is about your personal preference, and I still don’t see why you are pushing so hard to have this preference as a default. This really is a corner case that is of interest to so few users that I wouldn’t consider it a good use of a developer’s time. It’s easy to achieve what you want (have you looked into the luatex solution I provided?), what is insufficient about this? Who is going to profit from this long discussion? Thomas
On 04/29/2017 05:36 PM, Schmitz Thomas A. wrote:
[...]> Who is going to profit from this long discussion?
Sorry for having abused your help and your time, Thomas. I’m afraid I cannot explain such a basic issue clearly. Many thanks again for your kind help, Pablo -- http://www.ousia.tk
Have you played with the different "methods" defined in sort-ini.lua, lines 96-103?
Many thanks for your reply, Thomas.
The right values seem to be {zm, zc}. This works fine with Spanish and French, […]
Even with {zm, zc} or any of the predefined methods I don’t really think it is working completely as expected, even though the problem might just occur in very special cases: \setupregister[method={zm, zc}] \starttext \startTEXpage[offset=2em] \index{káv} \index{kav} \index{káva} \index{kava} \index{káf} \index{kaf} \index{káfa} \index{kafa} \index{kaka} \index{káka} \placeindex[language=es] \stopTEXpage \stoptext gives kaf kafa káf káfa kaka káka kav kava káv káva though I’d think it would be correct/logical to sort: kaf káf kafa káfa kaka káka kav káv kava káva It there a way to get this result with „methods“ or would I need to modify the sort-rules? (The example words are not Spanish, obviously, but Icelandic/Faroese, which I am trying to correct/set up. But I’d imagine that the same would be the desired behaviour for Spanish and languages with similar traditions, too, wouldn’t it?) Best Florian. ____________________________________________ Florian Grammel Copenhagen, Denmark
On 04/28/2017 06:27 PM, Florian Grammel wrote:
Have you played with the different "methods" defined in sort-ini.lua, lines 96-103?
Many thanks for your reply, Thomas.
The right values seem to be {zm, zc}. This works fine with Spanish and French, […]
Even with {zm, zc} or any of the predefined methods I don’t really think it is working completely as expected, even though the problem might just occur in very special cases:
Hi Florian, a simpler sample would be: \mainlanguage[es] \setupregister[language=es, method={zm, zc}] \starttext \startTEXpage[offset=2em] \index{cómodo} \index{comodos} \index{cómoda} \placeindex \stopTEXpage \stoptext I know that "comodos" isn’t a word in Spanish. But it should be the last word in the sorting.
It there a way to get this result with „methods“ or would I need to modify the sort-rules?
I think that sort-lan.lua might be wrong here. I explain why. German ("de-DE") has the following code: replacements = { { "ä", 'ae' }, { "Ä", 'Ae' }, { "ö", 'oe' }, { "Ö", 'Oe' }, { "ü", 'ue' }, { "Ü", 'Ue' }, { "ß", 's' }, }, This is to get Umlaut-forms and eszet sorted as ae, Ae, oe, Oe, ue, Ue and s (which I wonder whether ß shouldn’t be replaced as ss). Austrian German ("de-AT") doesn’t contain these replacements. Umlaut-forms are given different entries. I guess if we need a similar behavior for Spanish, replacements of accented glyphs should be created, such as in: replacements = { { "á", 'a' }, { "Á", 'A' }, { "é", 'e' }, { "É", 'E' }, { "í", 'i' }, { "Í", 'I' }, { "ó", 'o' }, { "Ó", 'O' }, { "ú", 'u' }, { "Ú", 'u' }, { "ü", 'u' }, { "Ü", 'u' }, }, So you get the right word order: cómoda cómodo comodos But Hans has to confirm the issue before. Just in case it helps, Pablo -- http://www.ousia.tk
participants (4)
-
Florian Grammel
-
Pablo Rodriguez
-
Schmitz Thomas A.
-
Thomas A. Schmitz