Hi Javier, Javier Bezos wrote:
Taco:
From the excerpt:
From now on, whenever \LUATEX\ has to open a text file, it will call the function \type{file_opener} instead of actually opening the file itself. It stores the returned table in its memory, and it uses the function attached to the \type{reader} label for reading lines.
If I've understood correctly, this applies to files not yet opened, but usually the encoding is stated inside the file (ie, the file is already open).
That is not a problem, because *you* are the one opening the file; it is completely under your control. Assume for a moment if you will that all files begin a first line that contains a statement like this: % encoding=iso-8859-2 Here is an example of how you could extract that information from the files, without confusing the rest of the system (-- is a line comment that you can use in pure .lua files): -- input: a file object -- output: a string representing that file's encoding function find_file_encoding (f) -- read a line local line = f:read() -- reset the file offset (not really needed in this case) f:seek("set",0) -- search for encoding -- %w == all alphanumerics, -- %- = a literal dash local fchar, lchar, match = line:find("encoding=([%w%-]+)") if fchar == nil then -- no encoding found, return a default return "iso-8859-1" else return match end end You now have to hook this new function into 'file_opener', like so: function file_opener (fname) local f = io.open(fname) if f == nil then return nil else local encoding = find_file_encoding(f) local readline = function () local s = ""; local line = f:read() if line == nil then return nil else return latin_to_utf(line, encoding) end end return { reader = readline } end end Now you know the file encoding and can make decisions based on that information (by changing 'latin_to_utf', see below).
But where is the input encoding? Apparently this changes the "Unicode" representation from 8 bits (thus limited to the range 0-255, which is certainly latin-1) to utf-8, without reencoding anything (say, iso greek, koi8, macos, jis, etc.). I've googled for docs on unicode for lua but I haven't found anything particularly useful.
An 8-bit encoding is nothing more than a mapping of 256 byte values into unicode code points. In the simplest case, this is an identity map, and the only difference is in file format representation (that is what happened in my original example). In a somewhat less trivial case, there is an array of 256 values. Such an array could look like this: -- table values are borrowed from ConTeXt. encodings = { ["iso-8859-2"] = { [0] = 0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007, -- -- 240 other entries -- 0x0159, 0x016F, 0x00FA, 0x0171, 0x00FC, 0x00FD, 0x0163, 0x02D9 } Having this table, we can now rewrite the 'latin_to_utf' function: function latin_to_utf (line,enc) local s = ""; for c in string.bytes(line) do if encodings[enc] ~= nil then s = s .. unicode.utf8.char(encodings[enc][c]) else -- default is pass-through s = s .. unicode.utf8.char(c) end end return s end The resulting lua code is in the attached .lua file (with the full table, of course). For 16-bit encodings etc. the remapping is more complex of course, but this example should hopefully be enough to give you an idea of how to approach it. There is one big caveat I should warn about: because the current LuaTeX is essentially a merge of Aleph and pdfTeX, you almost certainly need an OTP to convert the resulting unicode values back to font encodings. And that problem is why the font and hyphenation subsystems need to be tackled next, before anything else. Which is what I'll start on next monday. Best, Taco