Hi,
Yeah, I got distracted for a while, actually into writing some fancy new tools for reverse engineering, something I've intended to do for quite a while now but never got inspired enough by any one problem :) I guess I'm trying to figure out the format mostly out of curiosity; I agree it's likely to be much easier to get to the texts just by scripting.
I still have at least one of the lookup tables to figure out; it seems that it (probably something like a Huffman tree) is read from the .PUB files, but compressed in yet another way, with something called "nibble codec". It might start to make sense once I figure out the offset it's read from.
Anyway, some pieces about the .PUB format, just since someone is going to be curious anyway. I've been looking at a file named km1979_e.pub from the 2006 version. It contains an article named "What has happened to love?". The sha1sum of the file is 6b4c4b0c7f93c04d8aa685fb773f6826af681d35.
This is going to contain stuff that won't make much sense without referencing the contents of the file; I guess it's of value mostly if someone else wants to try to figure it out now or later.
The first 16 bytes are something called the URE header and seem to be pretty fixed:
00000000 55 52 45 53 04 00 03 00 00 00 00 00 00 00 00 00 |URES............|
I seem to remember that the 04 here tells this is a PUB file, as opposed to some other of the files (like indices, NWT, etc.), but my notes are hazy about this. 03 might be the version, which the code requires to be 03.
After that comes the PUB header, 8 bytes:
00000010 00 00 84 00 00 00 03 09
Here, I think the dword at 0x12, that is 0x84, is the number of entries in some structure which I for now call (rightly or wrongly) the Document Area Position Info Chunk. I think the 0x09 is probably the number of entries in the index data structure that immediately follows at offset 0x18 (the Chunk Information table).
At offset 0x18, the Chunk Information table. It consists of 0x09 (from previous header) 9-byte entries, each of which contains a tag of one byte and two dwords. The latter of the two dwords seems to (usually?) be an index into the document, and I guess it might make sense for the other to be the size of the element pointed to.
00000018: 00 0e 02 00 00 e6 55 00 00 02 25 00 00 00 59 00 00 00...
That is, tag 0, dwords 0x20e and 0x55e6. One of the other entries is {4, 0x261cc, 0x5ec4}. At 0x5ec4 + 0x10 (for the URE header) seems to lie some kind of structure which seems to list offsets of the chunks of the article in question; in any case, it has further pointers into the file. Inside that structure, at offset 0x6029, for example, we have the value 0x193a6, which I think points to the header of the chunk structure at something like 0x10 (for the URE header) + 0x5ec4 (for the above structure) + 0x193a6 = 0x1f27a. Anyway in 0x1f27a + 4*0x84 + 5 (I don't know yet...) = 0x1f48f some kind of header for the text chunk, with some BTEC1 (whatever) encoded information. After that comes what looks like a possible file name for the original text, 0902_K79.LOV. After that at 0x1f4a2 is what I think is a 5-byte header describing the codec used for the text, details yet unknown.
Immediately after that starts a MTEC3 (something Huffman-like) decompressed text stream. I'm not quite yet ready to say much about the compression, although I have code to do the decoding given the lookup tables (just I don't understand it yet ;). This stream seems to only contain the title, in MEPS encoding, of course.
At 0x1f760 (bytes 1b c0 ad 42) starts a MTEC3 encoded text block for which I have the lookup tables and which I thus can decompress with my code. It's a piece of text from "What has happened to love?":
". However, use your good judgment for we want the householders to know we are there because we love them and want to help them—not just to place literature. If no one is at home, leave the tract out of sight. (It is illegal to put items in the mailbox.—See Our Kingdom Service, April 1976, Announcements, page 2.) A territory can be reported as worked when we cover it with tracts." (and so on, until the end of the article.)
MEPS-encoded, the beginning of this is
. (43 08) SPACE (61 fb) H (07 08) o (28 08) w (30 08) e (1e 08) v (2f 08) e (1e 08) r (2b 08) , (44 08) SPACE (61 fb) u (2e 08) s (2c 08) e (1e 08) SPACE (61 fb) y (32 08) ...
I have to hand it to you - you have a sense of persistence. It would be cool to know the format of the PUB files. Ultimately I think they will do away with the WTLIB CD and publish the whole darn thing online from now on. As mentioned in another thread, this will allow them to electronically publish their new light without leaving any trace of what was there before - except for those sites that may log internet history, or if curios developers want to database it for fun. :)
Go for it! When I get ready, I am going to dump all of my code and stats here on this site for others to look at and learn from.