problems still, within tounicode.c
Hi Akira, Karl Thanks for your help in sorting out how \pdfglyphtounicode allows access to upper-plane code-points. However, I think there is still a problem in how pdftex constructs the /ToUnicode CMap. This concerns characters with glyph names that use the ‘.’ qualifying construction; e.g. a.sc, b.sc , … aacute.sc , W.alt , Theta.var1 , etc. It seems that there can be entries for these glyph names within the database, but those entries are never recovered to be written into the CMap. This is because of the following coding: tounicode.sty lines 187 onwards:
/* this function set proper values to *gp based on s; in case it returns * gp->code == UNI_EXTRA_STRING then the caller is responsible for freeing * gp->unicode_seq too */ static void set_glyph_unicode(char *s, glyph_unicode_entry * gp) { ...
/* strip everything after the first dot */ p = strchr(s, '.'); if (p != NULL) { *buf = 0; strncat(buf, s, p - s); s = buf; }
...
The origin of this coding is surely Adobe’s stated way to establish a default for which character to select for Copy/Paste, Searching, etc. *** when there is no guidance from a CMap or /ActualText entry. *** However, pdftex is making it impossible to set such CMap entries for glyphs with qualified names involving the ‘.’ character. In short, \pdfglyphtounicode allows replacement Unicode strings to be entered into the glyph-name database, but … … set_glyph_unicode never uses those entries, replacing them instead with the unqualified glyph name. The attached file explores this using the libertine-type1.sty package. (Make sure libertine.map is enabled, to use this example.) My suggestion for altering tounicode.c , within the set_glyph_unicode function block, is to test the full name (including ‘.’s) first, for a datbase entry. If found, use it. Otherwise, try again using just the prefix (as at present). Or in case a name is multiply qualified; e.g., delta.sc.ipa (occurs in cmu-tipx.enc ) also omega.sc.ipa q.sc.ipa f.sc.ipa then drop off the qualifications from the end. So test in order: delta.sc.ipa delta.sc delta Without a fix of this sort, the true small-cap characters that are in Unicode can never be properly addressed, for archival/accessibility considerations, as well as Copy/Paste. Such characters occur within blocks: U+025A — U+02FF IPA Extensions U+1D00 — U+1D7F Phonetic Extensions U+A720 — U+A7FF Latin Extended-D U+FE50 — U+FE6F Small Form Variants And of course there are superiors and inferiors in other blocks, which also are affected, when glyph names are used, such as: i.superior n.superior /zero.inferior /one.inferior etc. as is very commonly used in fonts. Cheers Ross Dr Ross Moore Mathematics Dept | 12 Wally’s Walk, 734 Macquarie University, NSW 2109, Australia T: +61 2 9850 8955 | F: +61 2 9850 8114 M:+61 407 288 255 | E: ross.moore@mq.edu.au http://www.maths.mq.edu.au http://mq.edu.au/ [cid:75d17d3b-7e73-4ee3-a688-50d035309531@ausprd01.prod.outlook.com] CRICOS Provider Number 00002J. Think before you print. Please consider the environment before printing this email. This message is intended for the addressee named and may contain confidential information. If you are not the intended recipient, please delete it and notify the sender. Views expressed in this message are those of the individual sender, and are not necessarily the views of Macquarie University.
Dear Ross,
The origin of this coding is surely Adobe’s stated way to establish a default for which character to select for Copy/Paste, Searching, etc.
I think we have to ask the author, Thanh, on the issue. It seems that he considers everything after the first dot shows a feature, but is not a part of a glyph name. As you say, I can't find a period in glyph names aglfn.txt, glyphlist.txt, zapfdingbats.txt by Adobe. Thanks, Akira
rm> ... test the full name (including ôòø.ôòùs) first, for a datbase entry. If found, use it. Otherwise, try again using just the prefix (as at present). That surely sounds sensible. Or in case a name is multiply qualified; e.g., delta.sc.ipa (occurs in cmu-tipx.enc ) also omega.sc.ipa q.sc.ipa f.sc.ipa then drop off the qualifications from the end. So test in order: delta.sc.ipa delta.sc delta Ack. Thanh, can you confirm that we should go ahead with this plan? Thanks, Karl
Hi Karl,
Thanks for looking at this.
On Jun 13, 2017, at 9:10 AM, Karl Berry
On Tue, Jun 13, 2017 at 1:10 AM, Karl Berry
rm> ... test the full name (including °ТРЬ.°ТРЫs) first, for a datbase entry. If found, use it. Otherwise, try again using just the prefix (as at present).
That surely sounds sensible.
Or in case a name is multiply qualified; e.g., delta.sc.ipa (occurs in cmu-tipx.enc ) also omega.sc.ipa q.sc.ipa f.sc.ipa then drop off the qualifications from the end. So test in order: delta.sc.ipa delta.sc delta
Ack.
Thanh, can you confirm that we should go ahead with this plan?
I think it's reasonable to have this behavior possible. However, Iet's say we have \textsc{This is a heading} What is the common expectation when one does copy/paste this text from a pdf viewer? I personally expect to get the same text, not the unicode points for the *.sc glyphs from the actual font. This brings up another question: perhaps we can add the behavior proposed by Ross, but it has to be enabled explicitly via a primitive? Regards, Thanh
Hi Thanh,
On Jun 13, 2017, at 5:29 PM, The Thanh Han
Am 13.06.17 um 09:46 schrieb Ross Moore:
What is the common expectation when one does copy/paste this text from a pdf viewer? I personally expect to get the same text, not the unicode points for the *.sc glyphs from the actual font.
Perhaps. But then the author is stating that it is a heading, but not using proper markup that might be used to include tagging that could convey this.
or the author has written \textsc{ctan} or \textsc{ibm} in which case one would perhaps expect CTAN and IBM not ctan and ibm in the cut-and-paste. So I think it is more or less impossible to give a general rule that works universally. However, for cut and paste I guess that you usually wouldn't want to receive anything other than the base text not the *.sc glyph code points in most of the cases similar to wanting plain text when selecting, say, parts of a bold heading. So in my opinion the default should be plain base text As to deducing a heading structure if it is not properly marked up, that seems to me a totally different kind of usage and yes, there knowing that something is in a special font (and on a line by itself etc) might be a useful indicator, but it screws up the other use, so something would need to give frank
On 6/13/2017 9:29 AM, The Thanh Han wrote:
On Tue, Jun 13, 2017 at 1:10 AM, Karl Berry
mailto:karl@freefriends.org> wrote: rm> ... test the full name (including °ТРЬ.°ТРЫs) first, for a datbase entry. If found, use it. Otherwise, try again using just the prefix (as at present).
That surely sounds sensible.
Or in case a name is multiply qualified; e.g., delta.sc.ipa (occurs in cmu-tipx.enc ) also omega.sc.ipa q.sc.ipa f.sc.ipa then drop off the qualifications from the end. So test in order: delta.sc.ipa delta.sc http://delta.sc delta
i didn't follow this discussion but i'd rather start from the beginning so, delta, delta.sc etc
Ack.
Thanh, can you confirm that we should go ahead with this plan?
I think it's reasonable to have this behavior possible.
However, Iet's say we have
\textsc{This is a heading}
What is the common expectation when one does copy/paste this text from a pdf viewer? I personally expect to get the same text, not the unicode points for the *.sc glyphs from the actual font.
indeed
This brings up another question: perhaps we can add the behavior proposed by Ross, but it has to be enabled explicitly via a primitive?
Regards, Thanh
_______________________________________________ ntg-pdftex mailing list ntg-pdftex@ntg.nl https://mailman.ntg.nl/mailman/listinfo/ntg-pdftex
-- ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
participants (6)
-
Akira Kakuto
-
Frank Mittelbach
-
Hans Hagen
-
Karl Berry
-
Ross Moore
-
The Thanh Han