Re: [NTG-context] Doc to ConTeXt [was Re: HTML to ConTeXt]

10 Nov 2007


      On Fri, 09 Nov 2007 18:30:36 -0700, Andrea Valle  wrote:
...
After wasting my time with an awful pdf to html converter by
Acrobat,  I discovered this, you may all know:
http://pdftohtml.sourceforge.net/
Looks impressive...
...
The html  conversion is very very good in resulting rendering and
also in sources, but after some tweakings I got interested in the xml
conversion it allows.
The xml format  substantially encodes the infos related to page,
typically each line is an element. Plus, there are bold and italics
marked easily as <b> and <i>
I'm still struggling to understand something really operative of XML
processing in ConTeXt, so  I switched back to Python.
I used an incremental sax parser with some replacement.
This is today's draft.
Original:
http://www.semiotiche.it/andrea/membrana/02%20imp.pdf
Recomposed (no setup at all, only \enableregime[utf]):
http://www.semiotiche.it/andrea/membrana/02imp.pdf
Looks VERY impressive... Tell me, how did you set up the cropmarks etc.?
...
pdf --> pdftoxml --> xml --> python script --> tex --> pdf
I recovered par, bold, em, footnotes,  stripping dashes and
reassembling the text with footnote references. Not bad as a first step.
Did you also try pdftohtml --> html --> context?
...
I guess that you xml gurus could probably do much easier and cleaner.
So, I mean -just for my very specific needs, I con probably  take
word sources, convert to pdf and then finally reach ConTeXt as
discussed.
Again, very nice stuff!

Best wishes
Idris

-- 
Professor Idris Samawi Hamid, Editor-in-Chief
International Journal of Shi`i Studies
Department of Philosophy
Colorado State University
Fort Collins, CO 80523

--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/