@slii, Hi there! This looks like your one and only post. I've let this thread slide, and I missed your post. Sorry about that.
I don't think the .PUB files in the Watchtower Library are in this same format. I've done lately some reverse engineering of the 2006 version, and I now understand a bit about the format of the publication files.
Good for you! I don't have the stomach to get into those PUB files. I didn't really think the PUB files were encoded the way the mobile app is encoding its data.
First of all, I have so far seen nothing to indicate that any portion of the files are encrypted. Large parts of them are compressed, though, using a compression algorithm resembling Huffman which I'm starting to understand (I still don't fully understand the construction of some lookup tables used for decompression).
You may be correct here.
Some pieces of the textual data (mainly titles) are in uncompressed form, yet this is not immediately obvious from inspecting the files. This is because the system uses internally a 16-bit MEPS-specific character set, which I believe is able to represent multiple scripts but predates Unicode. Overall, I get the impression that whoever designed this knew quite well what they were doing, but there's obviously lots of arcane legacy baggage involved. As to why WTL, or at least the 2006 version, still uses MEPS-coded documents internally, I do not know; perhaps they consider it a useful obfuscation to throw at possible reverse engineers (it did make me scratch my hea for a while), or maybe it's just a legacy thing that hasn't been enough of a problem to touch in code that could be in maintenance-only mode.
Agreed. At this point I think MEPS is dead. Unicode can take its place quite easily and, in fact, be a lot easier to work with. But I think there are a lot of things about the WT Lib that are archane (more on that below).
For example, if you look at wte.lib, it contains lots of uncompressed strings (publication names, if I remember correctly). Most (all?) .pub files also contain some uncompressed strings. In English-language files, look for places where every other byte is a 08 (hex); most of those will likely be strings. At least uncompressed MEPS strings tend to be stored in a Pascal string like style, i.e. the first 16 bits are the length of the string (in bytes, so it's always an even number), and not null-terminated.
Interesting.
Some 16-bit values seem to be mapped to some kind of control codes that most probably specify things like italics. Most of the Latin alphabet seems to be in the 08xx range. Being a typesetting-oriented coding system, it also seems to contain codes for ligatures; for example, the "ff" ligature is apparently represented as 0851.
Specifically, the English alphabet is mapped to 16-bit values, which are stored in little-endian format, as follows (numbers in hex):
0800..0819 A-Z; 081a..0833 a-z; 0834..083d numbers, "1234567890" (BTW this is the only character set I know of where 0 comes after 9, not before 1)
0841..0844 ":;.,"; 0845, 0846 left and right single quotation marks; 0847..0848 "?!"; 084b..084e "()/-"; 084f em-dash; 0850 en-dash; fb61 <SPACE>
0851 ff ligature; 0865 é (small e with acute); 0x08fb hyphenation point (used a lot e.g. in NWT to show proper hyphenation of names).
fb57 and fb58 are often seen around dashes, as in <fb57>--<fb58>. Perhaps they prevent breaking the line between them?
I guess the overall format of the .pub files and the compression are a topic for another post.
Very interesting! You have definitely gotten into the weeds. Ultimately however, I think there might be an easier way. To date I have successfully extracted every piece of text from the 2011 and 2012 WTLibs. I did this without looking at the PUB files and without any manual clicking. I planned on mentioning it in a future thread with some statistics calculated from the text.
I thought it would be cool to see if there are any meaningful differences between the 2011 and 2012 version of text. That is, we expect some differences - new content for the 2012 version, and probably some new entries into the publication index. But aside from that, I am just wondering if there are any textual changes they snuck in. The NWT project turned out to be difficult because there were so many changes involved, and it was meant to be that way. But if you take, say, the Jan 1 1980 version of the WT in the 2011 version and the 2012 version, you would expect that there are no changes between these two documents.
Anyhow, the code base is the same between the 2011 and 2012 version, and probably all the previous versions. They just add new content. I can tell because in the 2011 version has a serious memory leak in it, and the same issue is carried over to the 2012 version.
MMM