Re: [NTG-context] ActualText

19 Sep 2009

      Arthur Reutenauer  skribis:
...
He means "ActualText tags" :-)  See the PDF spec section 14.9.4, page 623.
It's a more generic way to support searching than ToUnicode vectors: you just
specify the actual string of underlying Unicode characters.  The PDF spec uses
hyphenated "ck" in German as an example: you typeset "Druk-ker" but you want to
search for "Drucker".  You can't do that with ToUnicode vectors.
You also need ActualText tags to mark the difference between a
discretionary hyphen and an explicit hyphen in English, which programs
like Reader use when extracting text. When the hyphen is discretionary
you set the ActualText to Unicode AD instead of 2D. (That's mentioned
somewhere in the PDF spec.)

Another thing I just thought of that isn't always done is that there
should be explicit space characters between words, including at the
ends of lines, although I'm not sure whether Adobe Reader turns off
its word-boundary heuristics if it sees space characters.

Since what I enjoy doing is making e-books that can be searched
through and, perhaps more importantly, extracted from via the Select
tool, it's important to me to make the search, selection, and
extraction features work. I'll use them myself if I choose, for
instance, to quote from an e-book I made. I've added them in my
(heavily) modified version of ant, but that's in a primitive state, a
long-term project that competes with font-making and e-book-making for
time, and so I'd like to have ConTeXt as well. I like ConTeXt a lot.

Also, I noticed when playing around with the examples from the "Th"
ligature discussion that searching and extraction didn't work with
small caps, though it did work with the ligature. With ActualText tags
these things always work, regardless of the ToUnicode map's
contents. The way Cairo's PDF backend handles this is to use an
ActualText tag for any glyphs that aren't included in the font's
encoding. What I did in my modified ant is to generate a ToUnicode map
from the Adobe glyph naming convention
(http://www.adobe.com/devnet/opentype/archives/glyph.html) and then
put an ActualText tag on anything that happens not to match what you
would get from the ToUnicode mapping.

(For reasons that were stupid, I once created a lame little C library
to do the mapping from glyph names to Unicode, using a compressed
lookup trie:
http://code.google.com/p/kompostilo/source/browse/#svn/trunk/support-librari...
)

Re: [NTG-context] ActualText

Barry Schwartz