Re: [NTG-pdftex] Incomplete CharSet causes failure with PDF/A validation

27 Jan 2019

      Hi Hans, and others

On 26 Jan 2019, at 8:09 pm, Hans Hagen mailto:j.hagen@xs4all.nl> wrote:
...
PDF/A-2 and PDF/A-3 relax many of those 'may not include’s,
which are mostly things that TeX does support.
The optionality of /CharSet is just another such relaxation.
just wondering: do you see any technical advantage in this CharSet bit
array, other than it being an option to predict maybe font memory
allocation demands or so (which then in turn is useless as the pdf
format has many aspects that will bloat memory usage anyway)

I can envisage a possible use for having this knowledge of which glyphs
are available internally in a font subset.

PDFs are now editable, at least in Acrobat Pro.
So knowing what characters are available lets software easily determine
whether a simple edit that changes or adds characters to a text block
can simply be performed using the embedded font subset,
or whether a font substitution is needed to do the specific edit.

Of course it is preferable to not have to substitute, as this can change
the metrics, hence potentially making a noticeable change to the
visual appearance of that text block.

If you have ever tried to edit a PDF made by someone else (with TeX
or Word or …) then you should have experienced how things can
move around significantly within the same paragraph.
...
...
Anyway, right now the choices are a) omit /CharSet or
b) output a possibly-incorrect CharSet.
If there was a primitive that can control this, then that would
potentially be enough, at least for the present.
It would allow the CharSet to be omitted with PDF/A-2,3
but included with PDF/A-1.
in luatex it's an option

At what level?
Can it be done on a font-by-font basis? That would be ideal.

If just a command-line option when calling  lualatex  then that is
kind of workable.
Essentially it would require a user to have done a preflight check
and found that one of the fonts has a CharSet problem.
Then rerun with the option set, to get a valid PDF/A-2 (or 3) document.

It would be affecting all the Type-1 fonts, not just one of them.
The ability (described above) to later edit the PDF would be lost pretty
much entirely.
...
This distinction would need to be documented (in  pdfx.pdf  say )
so that authors can understand the issue and choose the appropriate
package-loading option for their own circumstances.
I’m happy to do this.
...
But I’ve not yet looked at how the subsetted font is constructed.
My thought is that the latter needs to adjust the  gl_tree  before it is
used.
As I said previously, this will be a timing issue; so I’m not confident that
I could correctly write the necessary coding, using programming structures
that I don’t fully understand.
i don't know about pdftex but it is something delayed to the last when
the 'combined' font resource is added as different tex fonts using the
same resource can get different entries (and width arrays) but share the
blobs

My understanding of the code in  writefont.c  is that the Font Descriptor
dictionary is constructed (and written) as a complete object, before the font
subset itself is constructed.
For the CharSet, the entries in gl_tree are used, based upon a list of the characters
explicitly using that font. This does *not* include implicit glyphs, such as
 /grave (and perhaps /a ) with /agrave .

It was such a circumstance that initiated this conversation roughly a year ago.
I looked at solutions like writing the accent characters in white, outside the page
boundaries, as an /Artifact say.  But this begets a range of difficulties, and could
potentially affect the pagination or typesetting, and can fail other accessibility checks.
I want to develop reliable means to construct documents simultaneously for both
Archivability and Accessibility.
...
...
(I highly doubt that Thanh has time to
look into this.) Sorry, but that's the reality. -k
it's probably not that complex; i also doubt if the quality of that
vector should be perfect as probably only its prensence is checked, not
its internal validity (which then would also demand checking fonts which
afaik doesn't happen in detail); and i bet that viewers ignore its
content anyway
From the veraPDF link that Reinhard provided, it seems that presence
is checked with PDF/A-1, but not accuracy.
But for PDF/A-2 and 3, there is an more detailed check for accuracy.

Perhaps true for viewers; but PDFs are becoming about *more* than just
the visual view. We want to be providing the structures required for accurate
text extraction and editing. TeX was never designed with this in mind, but
because of its programmability this is something that should be achievable.

Hans

-----------------------------------------------------------------
Hans Hagen | PRAGMA ADE
Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
tel: 038 477 53 69 | www.pragma-ade.nlhttps://protect-au.mimecast.com/s/i9o5CgZ05JfPE8zwCEkFIz?domain=pragma-ade.n... | www.pragma-pod.nlhttps://protect-au.mimecast.com/s/EEOhCjZ12RflogxGHnU_YL?domain=pragma-pod.n...
-----------------------------------------------------------------

Cheers.

Ross

Re: [NTG-pdftex] Incomplete CharSet causes failure with PDF/A validation

Ross Moore