Error with \pdfglyphtounicode when surrogates are involved.
Hi all. I’ve just discovered a problem with \pdfglyphtounicode when you are trying to map a character to a Plane-1 code-point. Here is a minimal working example that shows the issue. %%%%% cut here for test file %%%%%% \pdfcompresslevel=0 \pdfgentounicode=1 \input glyphtounicode.tex \pdfglyphtounicode{Z}{D835DC81} % MATH bold-italic-Z U+1D481 (U+D835 U+DC81) Z $Z$ \bye %%%%% end cut here for test file %%%%%% Using: This is pdfTeX, Version 3.14159265-2.6-1.40.18 (TeX Live 2017) (preloaded format=pdftex) The two fonts both get a bad entry in their /ToUnicode CMap resource. viz. <5A> <36E537DC81> instead of the intended: <5A> <D835DC81> The Hex string <36E537DC81> is not just wrong it is actually invalid for a CMap entry, which is supposed to have a multiple of 4 Hex digits, not 10 of them. A cut&paste of the `Z`s in the PDF output produces chinese glyphs, which is usually a sign that some UTF-8 sequence has got screwed up. Of course I don’t really want to map all `Z`s into Plane-1. This is just an easy way to illustrate the problem that I discovered when trying to support proper Cut/Paste of exotic characters in LinLibertine & LinBiolinum fonts. Cheers Ross Dr Ross Moore Mathematics Dept | 12 Wally’s Walk, 734 Macquarie University, NSW 2109, Australia T: +61 2 9850 8955 | F: +61 2 9850 8114 M:+61 407 288 255 | E: ross.moore@mq.edu.aumailto:ross.moore@mq.edu.au http://www.maths.mq.edu.au [cid:image001.png@01D030BE.D37A46F0]http://mq.edu.au/ CRICOS Provider Number 00002J. Think before you print. Please consider the environment before printing this email. This message is intended for the addressee named and may contain confidential information. If you are not the intended recipient, please delete it and notify the sender. Views expressed in this message are those of the individual sender, and are not necessarily the views of Macquarie University.
Here is a minimal working example that shows the issue.
In the original test.tex, I encountered an assertion error in tounicode.c. If I change as \pdfcompresslevel=0 \pdfgentounicode=1 \input glyphtounicode.tex \pdfglyphtounicode{Z}{1D481} % MATH bold-italic-Z U+1D481 (U+D835 U+DC81) Z $Z$ \bye I obtained an attached test.pdf, in which I can find two strings <5A> <D835DC81>. Best, Akira
Hi Akira,
On 30/05/2017, at 19:39, "Akira Kakuto"
Here is a minimal working example that shows the issue.
In the original test.tex, I encountered an assertion error in tounicode.c. If I change as
\pdfcompresslevel=0 \pdfgentounicode=1 \input glyphtounicode.tex \pdfglyphtounicode{Z}{1D481} % MATH bold-italic-Z U+1D481 (U+D835 U+DC81)
Yes, I discovered that this works, shortly after posting. But it doesn't explain why what should be valid input is changed to something invalid. And what should happen with longer strings? For example this works fine: \pdfglyphtounicode{t_t}{00770077} % break ligature into separate letters as does adding multiple combining accents (e.g. 0300 etc.) after the Hex for other characters. But what if it was required to map into multiple higher-plane glyphs? Is the only documentation on this by reading the source file? tounicode.c
Z $Z$
\bye
I obtained an attached test.pdf, in which I can find two strings <5A> <D835DC81>.
Best, Akira
Cheers, Ross
Is the only documentation on this by reading the source file? tounicode.c As far as I know yes. -k
Hi Karl, Akira and others.
On May 31, 2017, at 7:19 AM, Karl Berry
In the original test.tex, I encountered an assertion error in tounicode.c.
That is assert(code >= 0);. Probably your "long" is 64bit. My "long" is 32bit. I think the type of code should be changed as int32_t instead of long. Best, Akira
Hi Ross, \pdfglyphtounicode{Z}{D835DC81} % MATH bold-italic-Z U+1D481 (U+D835 U+DC81) Z $Z$ ... viz. <5A> <36E537DC81> instead of the intended: <5A> <D835DC81> Call me stupid, but I don't understand why the glyphtounicode value does not apply to the text `Z' as well as the math `Z'. The glyph name in cmr10 is /Z just as in cmmi10. Can you, or anyone, elucidate? (I realize you're reporting a separate bug, that the value gets misinterpreted.) --thanks, karl.
Hi Karl,
I realize you're reporting a separate bug, that the value gets misinterpreted
\pdfglyphtounicode{Z}{D835DC81}
<5A> <36E537DC81>
I confirmed that Ross's \pdfglyphtounicode{Z}{D835 DC81} with a space works ok. In the case of \pdfglyphtounicode{Z}{D835DC81} I encountered an assertion error because long code = 0XD835DC81 < 0 in my case, where sizeof(long) = 4. Ross obtained erroneously vh = 0X36E537, vl = 0XDC81 because long code = 0XD835DC81 > 0, if sizeof(long) = 8. Is assert(code >= 0 && code <= 0X10FFFF) OK or not OK? (from tounicode.c) static char *utf16be_str(long code) { static char buf[SMALL_BUF_SIZE]; long v; unsigned vh, vl; assert(code >= 0); if (code <= 0xFFFF) sprintf(buf, "%04lX", code); else { v = code - 0x10000; vh = v / 0x400 + 0xD800; vl = v % 0x400 + 0xDC00; sprintf(buf, "%04X%04X", vh, vl); } return buf; } Best, Akira
Hi Akira,
On May 31, 2017, at 2:59 PM, Akira Kakuto
participants (3)
-
Akira Kakuto
-
Karl Berry
-
Ross Moore