Character names (was: Context 2005.12.19 released)
Hans Hagen wrote:
Mojca Miklavec wrote:
Taco Hoekwater wrote:
New features since 2005.12.18:
* Support for the latin-9 regime (latin-1 + euro)
There are some more (automatically generated) regime definitions at http://pub.mojca.org/tex/enco/contextbase/ (only from the glyph names that I was able to extract from the existing files, so it's only OK for some of the regimes mentioned there).
If possible, I would like to ask for core support for windows-1250 (perhaps other users may find some other regimes useful as well).
just send me the files you feel confident with
(I'll send the good files soon.) Except Celtic, Thai, Arabic and Hebrew (although the letter names for Hebrew are almost completely defined) almost all the windows and ISO regimes are OK, just some glyphs are missing (which are, or at least were, missing in Unicode vectors as well). If anyone has suggestions for names for the following characters, 6 additional regimes can be fully supported: windows-1251 and iso-8859-5 2116 NUMERO SIGN windows-1253 0385 GREEK DIALYTIKA TONOS 2015 HORIZONTAL BAR 0384 GREEK TONOS windows-1258 0300 COMBINING GRAVE ACCENT 0309 COMBINING HOOK ABOVE 0303 COMBINING TILDE 0301 COMBINING ACUTE ACCENT 0323 COMBINING DOT BELOW 20AB DONG SIGN iso-8859-7 20AF DRACHMA SIGN 037A GREEK YPOGEGRAMMENI 2015 HORIZONTAL BAR 0384 GREEK TONOS 0385 GREEK DIALYTIKA TONOS iso-8859-10 2015 HORIZONTAL BAR Mojca
Here's what I can come up with. At least a few are acceptable, like the horizontal bar. \textnumero exists, but is only reachable in cyrillic encodings (fixable, I guess?), and the greek & vietnamese accents are also only usable in the correct encoding. I've used the \text... versions of the accents, but perhaps the actual commands are more correct (like \' and \~). Cheers, Taco \starttext \definecharacter texthorizontalbar {{--\kern 0pt--}} \definecharacter textdong {\underbar{\dstroke}} \starttabulate[|c|c|] \NC 0300 COMBINING GRAVE ACCENT \NC \textgrave \NC \NR \NC 0309 COMBINING HOOK ABOVE \NC \texthookabove \NC \NR \NC 0303 COMBINING TILDE \NC \texttilde \NC \NR \NC 0301 COMBINING ACUTE ACCENT \NC \textacute \NC \NR \NC 0323 COMBINING DOT BELOW \NC \textbottomdot \NC \NR \NC 037A GREEK YPOGEGRAMMENI \NC \unknownchar \NC \NR % prime? \NC 0384 GREEK TONOS \NC \greektonos \NC \NR \NC 0385 GREEK DIALYTIKA TONOS \NC \greekdialytikatonos \NC \NR \NC 2015 HORIZONTAL BAR \NC \texthorizontalbar \NC \NR \NC 20AB DONG SIGN \NC \textdong \NC \NR \NC 20AF DRACHMA SIGN \NC \unknownchar \NC \NR \NC 2116 NUMERO SIGN \NC \textnumero \NC \NR \stoptabulate \stoptext Mojca Miklavec wrote:
Hans Hagen wrote:
Mojca Miklavec wrote:
Taco Hoekwater wrote:
New features since 2005.12.18:
* Support for the latin-9 regime (latin-1 + euro)
There are some more (automatically generated) regime definitions at http://pub.mojca.org/tex/enco/contextbase/ (only from the glyph names that I was able to extract from the existing files, so it's only OK for some of the regimes mentioned there).
If possible, I would like to ask for core support for windows-1250 (perhaps other users may find some other regimes useful as well).
just send me the files you feel confident with
(I'll send the good files soon.)
Except Celtic, Thai, Arabic and Hebrew (although the letter names for Hebrew are almost completely defined) almost all the windows and ISO regimes are OK, just some glyphs are missing (which are, or at least were, missing in Unicode vectors as well). If anyone has suggestions for names for the following characters, 6 additional regimes can be fully supported:
windows-1251 and iso-8859-5 2116 NUMERO SIGN
windows-1253 0385 GREEK DIALYTIKA TONOS 2015 HORIZONTAL BAR 0384 GREEK TONOS
windows-1258 0300 COMBINING GRAVE ACCENT 0309 COMBINING HOOK ABOVE 0303 COMBINING TILDE 0301 COMBINING ACUTE ACCENT 0323 COMBINING DOT BELOW 20AB DONG SIGN
iso-8859-7 20AF DRACHMA SIGN 037A GREEK YPOGEGRAMMENI 2015 HORIZONTAL BAR 0384 GREEK TONOS 0385 GREEK DIALYTIKA TONOS
iso-8859-10 2015 HORIZONTAL BAR
Mojca _______________________________________________ ntg-context mailing list ntg-context@ntg.nl http://www.ntg.nl/mailman/listinfo/ntg-context
Taco Hoekwater wrote:
\definecharacter texthorizontalbar {{--\kern 0pt--}} \definecharacter textdong {\underbar{\dstroke}}
ok, i added those to enco-def.tex (end of file:) \startencoding[\s!default] \definecharacter texthorizontalbar {{--\kern\zeropoint--}} \definecharacter textdong {\underbar{\dstroke}} \stopencoding Hans
Taco Hoekwater wrote:
Here's what I can come up with. At least a few are acceptable, like the horizontal bar. \textnumero exists, but is only reachable in cyrillic encodings (fixable, I guess?), and the greek & vietnamese accents are also only usable in the correct encoding. I've used the \text... versions of the accents, but perhaps the actual commands are more correct (like \' and \~).
Cheers, Taco
\starttext \definecharacter texthorizontalbar {{--\kern 0pt--}} \definecharacter textdong {\underbar{\dstroke}}
Thanks for those ...
\NC 0300 COMBINING GRAVE ACCENT \NC \textgrave \NC \NR \NC 0309 COMBINING HOOK ABOVE \NC \texthookabove \NC \NR \NC 0303 COMBINING TILDE \NC \texttilde \NC \NR \NC 0301 COMBINING ACUTE ACCENT \NC \textacute \NC \NR \NC 0323 COMBINING DOT BELOW \NC \textbottomdot \NC \NR
I may be wrong, but aren't those used only in combination with other characters? I don't know if TeX (ConTeXt) can handle this (at least not yet). When I wrote the list a couple of days ago I forgot about that fact. If the accent would come before the charecter, this could be replaced by "\buildtextaccent...", but here there's perhaps no solution without some additional macros. (And since the Vietnamese seem to be satisfied with viscii and utf for now, supporting cp1258 is not crucial.) I double-checked the differences between the existing regimes and the ones that were automatically produced by a script. The list of regimes that are "ripe" for supporting is thus: cp125[ 0 | *1 | *2 | 3 | 4 | 7 ] iso-8859-[ *1 | *2 | 3 | 4 | *5 | *7 | 9 | 13 | *15 | 16 ] *viscii (with glyph names instead of \"\u\...) (The ones marked with a star are already supported, perhaps with some inconsistencies. Not supported: Hebrew, Arabic, Vietnamese? for cp125X and Arabic, Thai and Celtic for iso-8859-X.) I'll send the files (full content is already on my page), but I need to know how to split/group them (I guess it would be a bad idea to have one file for each encoding). Should there be one file for iso-8859 and one for windows encodings? What about those regimes that are already supported? I would like to move at least the "regi-win" (with 8 wrong definitions anyway) to a "less discriminating" place, don't know what to do with Greek and Cyrillic. And another set of questions: 1. Can someone check for (in)consistencies for greekupsilondiaeresis vs. greekupsilondialytika? Looks like the same glyph named differently at different places (functionality may break). 2. What to do with {\cyrillicGJE} {\'\cyrillicG} % 0403 CYRILLIC CAPITAL LETTER GJE {\cyrillicgje} {\'\cyrillicg} % 0453 CYRILLIC SMALL LETTER GJE {\cyrillicKJE} {\'\cyrillicK} % 040C CYRILLIC CAPITAL LETTER KJE {\cyrillickje} {\'\cyrillick} % 045C CYRILLIC SMALL LETTER KJE {\cyrillicgheupturn} {\cyrillicgup} % 0491 CYRILLIC SMALL LETTER GHE WITH UPTURN Which variant is better? Would it make sense to define \definecharacter cyrillicGJE {\buildtextaccent\textacute\cyrillicG} \defineaccent ' \cyrillicG {\cyrillicGJE} and then use \cyrillicGJE consistently? 3. PLEASE FIX: in enco-def.tex replace \cdots by something (\dots, I suppose, but I'm not sure) \definecharacter textellipsis {\mathematics\cdots} (I guess this "bug" was the reason for changing some definitions in regimes/encodings elsewhere.) Should \textellipsis be used for "2026 HORIZONTAL ELLIPSIS" or anything else? 4. \softhyphen, \hyphen or \- for "00AD SOFT HYPHEN"? 5. Urgently: what to do with quotations (without language discriminations if possible)? % 201A SINGLE LOW-9 QUOTATION MARK \quotesinglebase vs. \lowerleftsingleninequote % 201E DOUBLE LOW-9 QUOTATION MARK \quotedblbase vs. \lowerleftdoubleninequote % 2018 LEFT SINGLE QUOTATION MARK \quoteleft vs. \upperleftsinglesixquote % 2019 RIGHT SINGLE QUOTATION MARK \quoteright vs. \upperrightsingleninequote % 201C LEFT DOUBLE QUOTATION MARK \quotedblleft vs. \upperleftdoublesixquote % 201D RIGHT DOUBLE QUOTATION MARK \quotedblright vs. \upperrightdoubleninequote % 2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK \guilsingleleft vs. \leftsubguillemot % 203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK \guilsingleright vs. \rightsubguillemot % 00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK \leftguillemot vs. \greekleftquot (are Greek quotations treated specially or what is this doing in regi-grk?) % 00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK \rightguillemot vs. \greekrightquot vs. \prewordbreak\rightguillemot (in my point of view the last one may be better, but not fair since it's language dependent: may be OK for French, but not for German or vice versa; perhaps a language-sensitive macro could be inserted at this place?) 6. \textnumero, 0x2116 (and perhaps some other characters) should be added to unicode vector 33. 7. files regi-il1 and regi-win have many inconsistencies. I would like to suggest to do the following renamings: windows -> cp1252 il1 -> iso-8858-1 il2 -> iso-8858-2 iso88595 -> iso-8858-5 grk -> iso-8859-7 (the new one) and to add the following lines somewhere: % or perhaps the other way around \defineregimesynonym[utf-8][utf] \defineregimesynonym[utf8][utf] \defineregimesynonym[windows-1250][cp1250] \defineregimesynonym[windows-1251][cp1251] \defineregimesynonym[windows-1252][cp1252] \defineregimesynonym[windows-1253][cp1253] \defineregimesynonym[windows-1254][cp1254] %defineregimesynonym[windows-1255][cp1255] % not supported yet (Hebrew) %defineregimesynonym[windows-1256][cp1256] % not supported yet (Arabic) \defineregimesynonym[windows-1257][cp1257] %defineregimesynonym[windows-1258][cp1258] % not supported yet (Vietnamese) % for historical reasons \defineregimesynonym[windows][cp1252] % 5 - Cyrillic % 6 - Arabic (not supported) % 7 - Greek % 8 - Hebrew (3 signs missing) % 11 - Thai (not supported) \defineregimesynonym[il1][iso-8859-1] \defineregimesynonym[il2][iso-8859-2] \defineregimesynonym[il3][iso-8859-3] \defineregimesynonym[il4][iso-8859-4] \defineregimesynonym[il5][iso-8859-9] \defineregimesynonym[il6][iso-8859-10] \defineregimesynonym[il7][iso-8859-13] %defineregimesynonym[il8][iso-8859-14] % not supported yet \defineregimesynonym[il9][iso-8859-15] \defineregimesynonym[il10][iso-8859-16] \defineregimesynonym[latin1][iso-8859-1] \defineregimesynonym[latin2][iso-8859-2] \defineregimesynonym[latin3][iso-8859-3] \defineregimesynonym[latin4][iso-8859-4] \defineregimesynonym[latin5][iso-8859-9] \defineregimesynonym[latin6][iso-8859-10] \defineregimesynonym[latin7][iso-8859-13] %defineregimesynonym[latin8][iso-8859-14] % not supported yet \defineregimesynonym[latin9][iso-8859-15] \defineregimesynonym[latin10][iso-8859-16] % for historical reasons \defineregimesynonym[iso88595][iso-8859-5] \defineregimesynonym[grk][iso-8859-7] I can send the new files as soon as it gets clear how to group them. If additionalz the rest of the questions are answered, then new files can become more consistent without breaking anything. Sorry for the long mail, Mojca
Mojca Miklavec wrote:
I'll send the files (full content is already on my page), but I need to know how to split/group them (I guess it would be a bad idea to have one file for each encoding). Should there be one file for iso-8859 and one for windows encodings? What about those regimes that are already supported? I would like to move at least the "regi-win" (with 8 wrong definitions anyway) to a "less discriminating" place, don't know what to do with Greek and Cyrillic.
the problem with one file is that they will be loaded all which will make memory and hash usage extreme, so best split it in separate files
PLEASE FIX: in enco-def.tex replace \cdots by something (\dots, I suppose, but I'm not sure) \definecharacter textellipsis {\mathematics\cdots} (I guess this "bug" was the reason for changing some definitions in regimes/encodings elsewhere.)
Should \textellipsis be used for "2026 HORIZONTAL ELLIPSIS" or anything else?
that's for taco to decide
(are Greek quotations treated specially or what is this doing in regi-grk?) % 00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK \rightguillemot vs. \greekrightquot vs. \prewordbreak\rightguillemot (in my point of view the last one may be better, but not fair since it's language dependent: may be OK for French, but not for German or vice versa; perhaps a language-sensitive macro could be inserted at this place?)
see core-mis, maybe using \symbol[\c!leftquotation] helps
6. \textnumero, 0x2116 (and perhaps some other characters) should be added to unicode vector 33.
7. files regi-il1 and regi-win have many inconsistencies. I would like to suggest to do the following renamings:
% or perhaps the other way around \defineregimesynonym[utf-8][utf] \defineregimesynonym[utf8][utf]
\defineregimesynonym[windows-1250][cp1250] \defineregimesynonym[windows-1251][cp1251] \defineregimesynonym[windows-1252][cp1252] \defineregimesynonym[windows-1253][cp1253] \defineregimesynonym[windows-1254][cp1254] %defineregimesynonym[windows-1255][cp1255] % not supported yet (Hebrew) %defineregimesynonym[windows-1256][cp1256] % not supported yet (Arabic) \defineregimesynonym[windows-1257][cp1257] %defineregimesynonym[windows-1258][cp1258] % not supported yet (Vietnamese)
% for historical reasons \defineregimesynonym[windows][cp1252]
needs some thought
I can send the new files as soon as it gets clear how to group them. If additionalz the rest of the questions are answered, then new files can become more consistent without breaking anything.
so ... split the files Hans
Hans Hagen wrote:
PLEASE FIX: in enco-def.tex replace \cdots by something (\dots, I suppose, but I'm not sure) \definecharacter textellipsis {\mathematics\cdots} (I guess this "bug" was the reason for changing some definitions in regimes/encodings elsewhere.)
Should \textellipsis be used for "2026 HORIZONTAL ELLIPSIS"
Yes. But on the baseline, so: \definecharacter textellipsis {\periods\relax} U+2024 (ONE-DOT LEADER): \definecharacter textonedotleader {\doperiods[1]} U+2025 (TWO-DOT LEADER): \definecharacter texttwodotleader {\doperiods[2]} I believe there is a four-dot leader in unicode as well, but I can't find it right now. Taco
Taco Hoekwater wrote:
Should \textellipsis be used for "2026 HORIZONTAL ELLIPSIS"
Yes. But on the baseline, so:
OK, thanks.
\definecharacter textellipsis {\periods\relax}
So perhaps fix the unic-032.tex again then.
I believe there is a four-dot leader in unicode as well, but I can't find it right now.
There are many dots in Unicode. Section 205X (2058) for example. Mojca
Hans Hagen wrote:
Mojca Miklavec wrote:
(are Greek quotations treated specially or what is this doing in regi-grk?) % 00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK \rightguillemot vs. \greekrightquot vs. \prewordbreak\rightguillemot (in my point of view the last one may be better, but not fair since it's language dependent: may be OK for French, but not for German or vice versa; perhaps a language-sensitive macro could be inserted at this place?)
see core-mis, maybe using
\symbol[\c!leftquotation]
helps
The other way round: It's not "left quotation mark" that should turn into right guillemot, but right guillemot that should be classified as left or right quotation mark according to the current language in order to guarantee proper line breaking.
I can send the new files as soon as it gets clear how to group them. If additionalz the rest of the questions are answered, then new files can become more consistent without breaking anything.
Sorry, it was Christmas inbetween, so "as soon" lasted a bit more than a moment :)
so ... split the files
They're here: http://pub.mojca.org/tex/enco/contextbase/ Esp. for the Cyrillic one some definitions should be added first into the core in order to support some accented characters (\cyrillicGJE, \cyrillicKJE, \cyrillicgheupturn ... - see one of my last mails) Any comments? Mojca
Mojca Miklavec wrote:
\defineregimesynonym[windows-1250][cp1250]
the synonym features is already in the kernel; the following patch to regi-ini will permit file name synonyms, so \definefilesynonym[regi-win][...] patch: \def\douseregime#1% nearly identical to encoding {\doifundefined{\c!file\f!regimeprefix#1}% {\setvalue{\c!file\f!regimeprefix#1}{}% \makeshortfilename[\truefilename{\f!regimeprefix#1}]% \startreadingfile \readsysfile\shortfilename {\showmessage\m!encodings2{#1}} {\showmessage\m!encodings3{#1}}% \stopreadingfile}} so, we can make (many) files called regi-cp-1250 and then say \definefilesynonym[regi-win][regi-cp-1250] \defineregimesynonym[win][cp1250] (of course the internals of regi-... should become cp1250 then) Hans
Mojca Miklavec wrote:
\NC 0300 COMBINING GRAVE ACCENT \NC \textgrave \NC \NR \NC 0309 COMBINING HOOK ABOVE \NC \texthookabove \NC \NR \NC 0303 COMBINING TILDE \NC \texttilde \NC \NR \NC 0301 COMBINING ACUTE ACCENT \NC \textacute \NC \NR \NC 0323 COMBINING DOT BELOW \NC \textbottomdot \NC \NR
I may be wrong, but aren't those used only in combination with other characters? I don't know if TeX (ConTeXt) can handle this (at least not yet).
If the format was <accent> <char>, that would work, but unicode specifies <char> <accent>, and that cannot be done without a special font encoding that uses lots of ligatures. Taco
Taco Hoekwater wrote:
Mojca Miklavec wrote:
\NC 0300 COMBINING GRAVE ACCENT \NC \textgrave \NC \NR \NC 0309 COMBINING HOOK ABOVE \NC \texthookabove \NC \NR \NC 0303 COMBINING TILDE \NC \texttilde \NC \NR \NC 0301 COMBINING ACUTE ACCENT \NC \textacute \NC \NR \NC 0323 COMBINING DOT BELOW \NC \textbottomdot \NC \NR
I may be wrong, but aren't those used only in combination with other characters? I don't know if TeX (ConTeXt) can handle this (at least not yet).
If the format was <accent> <char>, that would work, but unicode specifies <char> <accent>, and that cannot be done without a special font encoding that uses lots of ligatures.
I thought so. But the issue is not a matter of font designers, but of underlying software. If TeX can't "unget" a character and replace it with the accented one, you can't ask font designers to add dozens of ligatures. Knuth didn't write TeX with Unicode conventions in mind, so I can understand that, I only wonder if XeTeX, Aleph [and exTeX] support such accents. I'll consider this as "leave Windows Vietnamese encoding unsupported" (they have two other encodings anyway). Thanks, Mojca
Mojca Miklavec wrote:
I thought so. But the issue is not a matter of font designers, but of underlying software. If TeX can't "unget" a character and replace it with the accented one, you can't ask font designers to add dozens of ligatures. Knuth didn't write TeX with Unicode conventions in mind, so I can understand that, I only wonder if XeTeX, Aleph [and exTeX] support such accents.
Inside the TFM file, it is fairly straightforward to instruct TeX to create a ligature from "a" followed by "`" to "à". That is a job for texfont or fontinst, not the font designers'. But it qualifies as a hack, not true unicode support. Anyway, no point spending time on it if nobody is going to use it. Cheers, Taco
On 12/23/05, Hans Hagen
Mojca Miklavec wrote:
I'll consider this as "leave Windows Vietnamese encoding unsupported" (they have two other encodings anyway).
indeed (also, i never heard vnpenquin ask for it -)
Hi all, In fact, VnTeX supports UTF-8, VISCII, TCVN, VPS. So it would be nice if ConTeXt could do the same :) In reality, the charset UTF-8 becomes a standard in Vietnam, and other charsets (VISCII, VPS, TCVN) are less and less used. For me, I used always UTF-8 for all my documents here (ConTeXt, OpenOffice, HTML, MySQL data,...). Thank you for your wonderful work, Merry Christmas ! -- http://vnoss.org Vietnamese Open Source Software Community
participants (4)
-
Hans Hagen
-
Mojca Miklavec
-
Taco Hoekwater
-
VnPenguin