Quickly invoke a self-defined index sorting file?
hi, I defined an index file for sorting Chinese. (a bit large with almost 4MB) https://github.com/Soanguy/ConTeXt-Chinese-Example/blob/master/sorting/sort-... But I have to pre-reference this file with input every time to enable it. How do I fuse these files with my local context system and use setup to enable it directly. I want directly use %%% \setupregister[index][n=4,language=cn-alpha,] %%% instead of %%%% \input sort-alpha.lua \setupregister[index][ n=4, language={cn-alpha},] %%% In addition to this, I found that the following notification appeared on the tex terminal, probably because there are too many characters in the index file (tens of thousands of characters). Can I avoid this notification? tex memory > bumping category 'token' succeeded, details: all=16000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=2000000 | min=2000000 | ptr=1999080 | set=10000000 | stp=1000000 | top=2000000 tex memory > bumping category 'token' succeeded, details: all=24000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=3000000 | min=2000000 | ptr=2999080 | set=10000000 | stp=1000000 | top=3000000 tex memory > bumping category 'token' succeeded, details: all=32000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=4000000 | min=2000000 | ptr=3999080 | set=10000000 | stp=1000000 | top=4000000 tex memory > bumping category 'token' succeeded, details: all=40000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=5000000 | min=2000000 | ptr=4999080 | set=10000000 | stp=1000000 | top=5000000 tex memory > bumping category 'token' succeeded, details: all=48000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=6000000 | min=2000000 | ptr=5999080 | set=10000000 | stp=1000000 | top=6000000 tex memory > bumping category 'token' succeeded, details: all=56000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=7000000 | min=2000000 | ptr=6999080 | set=10000000 | stp=1000000 | top=7000000 autumnus
On 1/12/2025 9:58 AM, autumnus wrote:
hi,
I defined an index file for sorting Chinese. (a bit large with almost 4MB)
https://github.com/Soanguy/ConTeXt-Chinese-Example/blob/master/sorting/sort-...
But I have to pre-reference this file with input every time to enable it. How do I fuse these files with my local context system and use setup to enable it directly.
I want directly use %%% \setupregister[index][n=4,language=cn-alpha,] %%%
instead of %%%% \input sort-alpha.lua \setupregister[index][ n=4, language={cn-alpha},] %%%
remove \startluacode and \stopluacode in that file and do this instead: \registerctxluafile{sort-hanzi}{} \starttext test \stoptext currently you first load that whole file in memory tokenized (i.e. 1 byte becomes 8 bytes) which is fine (and fast) for reasonable size files but in your case it has to bump token memory also, you don't really need the huge entries table because you're not going to split the index for every first character maybe this definitions["cn-hanzi"].entries = table.setmetatableindex(function(t,k) if utfbyte(k) < 1000 then return "latin" else return "chinese" end end) print(definitions["cn-hanzi"].entries['a']) print(definitions["cn-hanzi"].entries['咗']) but even then ... korean and japanese don't have that either so basically you only need the order (is that order defined somewhere in unicode?
In addition to this, I found that the following notification appeared on the tex terminal, probably because there are too many characters in the index file (tens of thousands of characters). Can I avoid this notification?
tex memory > bumping category 'token' succeeded, details: all=16000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=2000000 | min=2000000 | ptr=1999080 | set=10000000 | stp=1000000 | top=2000000 tex memory > bumping category 'token' succeeded, details: all=24000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=3000000 | min=2000000 | ptr=2999080 | set=10000000 | stp=1000000 | top=3000000 tex memory > bumping category 'token' succeeded, details: all=32000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=4000000 | min=2000000 | ptr=3999080 | set=10000000 | stp=1000000 | top=4000000 tex memory > bumping category 'token' succeeded, details: all=40000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=5000000 | min=2000000 | ptr=4999080 | set=10000000 | stp=1000000 | top=5000000 tex memory > bumping category 'token' succeeded, details: all=48000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=6000000 | min=2000000 | ptr=5999080 | set=10000000 | stp=1000000 | top=6000000 tex memory > bumping category 'token' succeeded, details: all=56000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=7000000 | min=2000000 | ptr=6999080 | set=10000000 | stp=1000000 | top=7000000
autumnus ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / https://mailman.ntg.nl/mailman3/lists/ntg-context.ntg.nl webpage : https://www.pragma-ade.nl / https://context.aanhet.net (mirror) archive : https://github.com/contextgarden/context wiki : https://wiki.contextgarden.net ___________________________________________________________________________________
-- ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
Thanks for the explanation. After using \registerctxluafile{sort-hanzi}{}, the bumping message did not appear. In terms of daily practical use, I really don't need so many characters. I just don't have the energy to pick out those thousands of commonly used Chinese characters from these 40,000 or 50,000 characters (In China, for example, there are only about 6,000-8,000 characters actually used on a daily basis. In Japanese, you may only need about 1000-3000 characters) There are only two commonly used sorts for these characters: (Sorting has nothing to do with unicode sorting) 1 according to the actual pronunciation of the characters and 2 according to the order in which the characters are written (strokes). (The situation in Japanese should probably be mostly sorted by actual pronunciation based on kana, but the pronunciation of kanji in Japanese is much more complicated than in Chinese.) But sorting by strokes, I don't have the ability to achieve it at the moment. So the three indexes I designed are sorted according to the actual pronunciation of the Chinese characters. The difference between them is only in the entries. 1 Sort in the order of a, b, c d, and use these letters as entries.(mostly used) 2 Sort in the order of a ai ao an ...... , and these pronunciations are used as entries. 3 Sort Chinese characters directly by their pronunciation and use them as entries. Because I know almost nothing about lua myself, just referring to sort-lang (just applying templates) For the sorting of Japanese, the sorting I see on latex so far is also directly marked out of the pronunciation, and then sorted by the pronunciation of the kana. (Because kanji in Japanese may have more than one pronunciation, and maybe even as many as 5). Unless there is a tool that can simultaneously phonetize the Chinese characters in the index at compile time.
On 1/12/2025 12:53 PM, autumnus wrote:
Thanks for the explanation.
After using \registerctxluafile{sort-hanzi}{}, the bumping message did not appear.
In terms of daily practical use, I really don't need so many characters. I just don't have the energy to pick out those thousands of commonly used Chinese characters from these 40,000 or 50,000 characters (In China, for example, there are only about 6,000-8,000 characters actually used on a daily basis. In Japanese, you may only need about 1000-3000 characters)
so the entries table can just be omitted then
There are only two commonly used sorts for these characters: (Sorting has nothing to do with unicode sorting) 1 according to the actual pronunciation of the characters and 2 according to the order in which the characters are written (strokes). (The situation in Japanese should probably be mostly sorted by actual pronunciation based on kana, but the pronunciation of kanji in Japanese is much more complicated than in Chinese.)
But sorting by strokes, I don't have the ability to achieve it at the moment. So the three indexes I designed are sorted according to the actual pronunciation of the Chinese characters.
The difference between them is only in the entries. 1 Sort in the order of a, b, c d, and use these letters as entries.(mostly used) 2 Sort in the order of a ai ao an ...... , and these pronunciations are used as entries. 3 Sort Chinese characters directly by their pronunciation and use them as entries.
Because I know almost nothing about lua myself, just referring to sort-lang (just applying templates)
you can set up a combination of sorting if needed so the 'order' table is what matters in your case, is that table made from some public list?
For the sorting of Japanese, the sorting I see on latex Unless there is a tool that can simultaneously phonetize the Chinese characters in the index at compile time.
if we have the basic data (how to pronounce a single char) then runtime is no big deal Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
Chinese Hanzi PinYin : https://github.com/mozillazg/pinyin-data/tree/master Chinese Hanzi Stroke : https://github.com/leo-liu/zhmakeindex/blob/master/CJK/strokes.go Japanese Hanzi pronunciation: https://github.com/cjkvi/cjkvi-tables/blob/master/joyo2010.txt
On 1/12/2025 2:54 PM, autumnus wrote:
Chinese Hanzi PinYin : https://github.com/mozillazg/pinyin-data/tree/master Chinese Hanzi Stroke : https://github.com/leo-liu/zhmakeindex/blob/master/CJK/strokes.go
assuming that you donwloaded kHanyuPinlu.txt you can run the attached test file You might want to set up an order / entries for the latin variant just prototyping here Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
participants (3)
-
autumnus
-
Hans Hagen
-
Hans Hagen