Is anyone aware if this project is already being done by someone? If not, perhaps we can brainstorm how to go about this, now that the PDF for the revised NWT is out. A couple initial impressions:
- I'm not aware of a decent program for extracting text from these PDFs. One that I tried yields the text from left-to-right from the left text column, across the cross-reference column in the middle, to the right column, so this is not useful.
- One can copy and paste the text from a PDF viewer into a rich-text editor and then save the content as RTF. The RTF will contain markup like:
\fs18 24 And so he drove the man out and posted at the east of the gar- den of E
\f1 \uc0\u56319 \u56329
\f0 den
\fs10 \up4 n
\fs18 \up0 the cherubs
\fs10 \up4 o
\fs18 \up0 and the flaming blade of a sword that was turning itself continually to guard the way to the tree of life.
This markup can be processed, the lines for cross-reference letters removed, the words hyphenated for line breaks restored to one piece, and the special character sequences for quotes and apostrophes replaced with quotes and apostrophes.
- Numerous changes will be of the nature of repetitive substitutions like changing "has declared" to "said" or "proceeded to assault" to "assaulted". We'll probably want a way to group changes so that we can sift the more interesting ones from the chaff, since the majority of changes will represent simplified grammar.
Before I go about trying to do this, I wanted to see if anyone has any better ideas or experience in this sort of thing.