looking for fun bilingual texts to use as tests of my open-source OCR software

bcrowell · June 27, 2021, 2:15pm

I’ve posted previously about my goal of coming up with innovative ways of presenting Greek texts for people who aren’t fluent Greek readers:

http://discourse.textkit.com/t/open-source-machine-readable-translating-dictionaries/18385/1

As described in that thread, this led me to look for machine-readable data giving statistical correlations between Greek words and the English words that would tend to occur in an English translation. For this purpose, there was a really nice resource I located, which was a version of the Odyssey by someone named Giles, in which he twisted the Greek syntax around to be more like English, and then provided English words interspersed with the Greek:

http://discourse.textkit.com/t/describing-the-style-of-this-greek-font/18439/1

That led me to want to OCR the book, which it turns out was not really possible with the most widely used open-source OCR software, called Tesseract. Although Tesseract does theoretically support multilingual texts, in reality it doesn’t seem to do this well enough to be useful. Since I’m recently retired and have lots of free time now, I decided to write my own OCR software that would be better adapted to this sort of thing, using a qualitatively different design for the technical innards. This software, which I call Dorcas, is now working pretty well:

https://github.com/bcrowell/dorcas

“Working pretty well” means that it’s a pretty decent pre-alpha version, i.e., it’s probably not yet practical for other people to download it, read my documentation, and give it a test drive, but it does produce decent scans. Here’s a sample from Giles:

(Giles uses a weird accentuation system in which he includes breathing marks and those little iota dinguses underneath letters, but he omits all other accents.) Here’s the OCR output from my software:

===============================
appearance ξεινω Μεντη to thoetranger Mentes
ἡγητορι leader Υαφιων of the Taphians Εὑρε
δ αρα und she found then μνηστηρας αγηνορας
the haughty suitors οἱ μευ they επειτα then
ετερπον were delighting θυμον their mind πεσ
σοις with draughts προπαροιθεν in front θυραων
of the doors ἡμενοι sitting ευῥινοισιν un skins
βοων of bulls οὑς which αυτοι themselves
εκτανον slew Κηρυκες δε but heralds και and
οτοηροι θεραποντες busy attendants αυτοισιν
upon them οἱ μευ ορα some indeed εμισγον
were mixing ρινον wine και ὑδωο und water ενι
κρητηρσιν in bowls ο δε but others αυτε again
νιζον were wiping τοαπεζας thotablezo πογγοισι
πολυτοητοισι with their porons sponges και
and προτιθεντο were putting them forwards
και and δατευντο were dealing forth πολλα
κρεα many meats

So you can see that it’s certainly not perfect, but is for the most part reasonably accurate and intelligible. The quality of the OCR for the Greek is actually quite a bit better than for the English.

I’ve spent a long time using this particular book by Giles, and its fonts, as a test case while developing the software. At this point, I think it would be a really good exercise for me to go through the whole process with a new text that uses a different font. This would help me to test the software, and get it to a state where it would be more likely to be usable by other people.

Does anyone have a bilingual text that I could use for this purpose? Ideally this would be something fun to work with for its own sake, because you and I would not be doing any of this unless it was fun. Maybe there’s some old multilingual book that you really like, but it was always inconvenient to use because it was only page scans on archive.org. It doesn’t have to be Greek and English interspersed on the same line, as in Giles. If you have something that’s an interlinear translation, that could be fine. However, my software does require that it have regular, fairly even line spacing, all in a consistent font size. Ideally this would be something that has no OCR’d version available, so that there would actually be some benefit to OCRing it. If there’s a proprietary version of an old bilingual book, and you’d like to see it scanned and made freely available without copyright restrictions, that could be a lot of fun.

Barry_Hofstetter · June 27, 2021, 2:57pm

I can’t (and couldn’t) address the technological issues you raise, but the above is pretty horrible. Twisted the Greek syntax to be more like English and then essentially provides an interlinear? No, that’s a terrible way to learn the language, and should be avoided at all costs. If that’s what you want to do, please stop.

bcrowell · June 27, 2021, 4:12pm

If you go back and look again at what I posted, I think you’ll see that I explained the reason for my interest in the text, and it isn’t because I intend to read Homer in that format.

My purpose in starting this thread was also not to ask for technical help. It was to ask people if they knew of multilingual texts that would be fun to work with using the OCR software I’ve written, in order to test the software.

jeidsath · June 27, 2021, 5:15pm

Contra Barry, the word order there appears to me to be mostly changed to a more natural prose order. Nothing really bad that I see (in that section).

There is a Russian guy who has explored that interspaced interlinear style for modern language learning, and some people swear by it. The red and green text guy.

For testing your OCR (which looks good, though Tesseract is quite adaptable for dual-language work), why not try old Loeb scans? Combine the facing pages into one image, and your software should treat it like a mixed text. That would have more normal accents at least.

bcrowell · June 27, 2021, 6:46pm

Thanks, joel, for the suggestions and feedback.

I did initially try to use tesseract for this purpose, but it really didn’t work well enough, at least in the case of the book I was working on. I spent about 5 days, on and off, trying to coax it into giving acceptable results, training it, doing everything suggested in the relevant docs, and giving it customized dictionaries. I posted about my efforts on the tesseract group last month, and although I did get some helpful suggestions, it really never worked at even a minimal level. Basically it could never reliably tell what was Greek and what was English, so roughly half the Greek words were interpreted as English gibberish instead. It was only after that failed effort that I decided it might be worth the time to write my own software.

The Russian green and red guy sounds interesting, although my interest would be more in stealing ideas for innovative methods of presentation. I tried googling but didn’t come up with any search terms that pointed me to him. Can you recall anything more? Was he doing English-Russian bilingual texts? Any recollection of a specific text that he used, whether he was at a certain university, anything like that?

It wouldn’t be hard to come up with artificial test files (your Loeb suggestion sounds reasonable), but I would prefer to work on a text that might actually be useful and/or fun.

jeidsath · June 27, 2021, 8:20pm

Ilya Frank - http://english.franklang.ru/index.php

dnl · June 27, 2021, 10:32pm

I played with Tesseract for a bit a few years back with a view to doing something similar to what you are trying now. I was gInmageReader GUI as a front end to it. I was looking at the bilingual problem too. It was clear that traiing was going to be required. I was interested in scanning stuff like North and Hillard, the Greek composition book. Grammars would also be a good thing to start with because they have horrid sections of intermixed Greek, Latin, sometimes Hebrew, translit of one kind or another and English. If you can get all that going properly I know there is interest.

I also spoke to ABBYY about this issue - but there isn’t a whole lot of incentive for them to get into the dead language arena it seems.

As to the unaccented texts, you’ll probably want to train on both those and fully accented. And uncials, of course all in prepared unicode rather than handwritten. So a bunch of those texts as well. Depending upon how capable a final product you’ll need a bunch of training data, or multiple sets. Any dictionaries for inferring the right word will probably need to handle spelling errors, if you want to keep them, as well as dialects. Not trying to boil the ocean here just some thoughts, I recall from when I looked at it last.

Thx
D

dnl · June 27, 2021, 10:41pm

“Working pretty well” means that it’s a pretty decent pre-alpha version, i.e., it’s probably not yet practical for other people to download it, read my documentation, and give it a test drive, but it does produce decent scans. Here’s a sample from Giles:

(Giles uses a weird accentuation system in which he includes breathing marks and those little iota dinguses underneath letters, but he omits all other accents.) Here’s the OCR output from my software:

===============================
appearance ξεινω Μεντη to thoetranger Mentes
ἡγητορι leader Υαφιων of the Taphians Εὑρε
δ αρα und she found then μνηστηρας αγηνορας
the haughty suitors οἱ μευ they επειτα then
ετερπον were delighting θυμον their mind πεσ
σοις with draughts προπαροιθεν in front θυραων
of the doors ἡμενοι sitting ευῥινοισιν un skins
βοων of bulls οὑς which αυτοι themselves
εκτανον slew Κηρυκες δε but heralds και and
οτοηροι θεραποντες busy attendants αυτοισιν
upon them οἱ μευ ορα some indeed εμισγον
were mixing ρινον wine και ὑδωο und water ενι
κρητηρσιν in bowls ο δε but others αυτε again
νιζον were wiping τοαπεζας thotablezo πογγοισι
πολυτοητοισι with their porons sponges και
and προτιθεντο were putting them forwards
και and δατευντο were dealing forth πολλα
κρεα many meats

Nice. I would be interested in the innards but regardless that’s getting nice results though a proof read follow up and correct would still be required. I assume you plan to fix the iota-subscript handling and expand to handling accents ? I would definitely pick up a text that has fully accented text to try. You can product your own by printing a section of the GNT to PDF or JPG in various typefaces.

Something of an aside but occurs to me to ask simply out of curiosity: how affected by scan quality is your algorithm ?

Thx
D

bcrowell · June 27, 2021, 11:04pm

Thanks I have a brief overview of the technical methods near the bottom of the README file, which is what is displayed when you visit the github page.

Yes, that’s one of the things I’m looking for in a text to work with next.

I would prefer to work with a document for which the OCR output would actually be of use to someone. The software would probably do quite well on artificially generated computer-typeset text, but that would be kind of pointless.

So far the best candidate I’ve found is this: Souter, A pocket lexicon to the Greek New Testament, https://archive.org/details/pocketlexicontog0000sout/page/86/mode/2up . It has a mixture of Greek and English on the same line, and I imagine that an OCR’d result might be of some use to other people. The only disadvantage is that it has Greek in several different font sizes, which makes it a harder problem for my software.

It’s working pretty well on the Giles book, which, although a high-resolution scan, is actually pretty messy and difficult for software to work with. There is broken type, characters where the metal didn’t quite make full contact with the page, smudges, and so on. For example, on one of the early pages that I was using for testing there was a word that objectively looks like “wcrc,” which was actually “were” with the cross-bars on the e’s not present. The software makes at least some kind of attempt to fix such things, and will often succeed if there are only one or two flaws in the word and the word is in its dictionary.

bcrowell · June 27, 2021, 11:21pm

Interesting to know that others have tried to do thins kind of thing. Looking at North and Hillard, https://archive.org/details/greekprosecompos0000nort/page/148/mode/2up , I think that would be extremely difficult. They have things like curly braces with a list of options presented vertically inside (p. 148). The English-language pages would of course be doable with tesseract.

What I’m doing is fundamentally unlike how Tesseract works, in that there is no notion of training on a bunch of stuff and then hoping that the software will be competent to handle anything at all that you throw at it. You just show Dorcas the characters of one or two very specific fonts (e.g., the Latin and Greek fonts used in Giles), at a certain size, and that’s it. That’s all it knows and all it needs to know. So the good news is that you don’t have to send it to graduate school and hope it comes back with a broad education, but the bad news is that it has to be taught the fonts for each book it’s going to OCR. This is why I’m not so interested in just making up artificial examples as tests for the software. Training Tesseract on such a text would produce some benefit that would carry over to other texts, but using Dorcas on it would produce no benefit whatsoever aside from the testing.

bcrowell · June 27, 2021, 11:37pm

I see, thanks. That’s actually very cool. I find it a little mentally distracting to see the two languages interspersed like that, but I can see how it could work well for a total beginner at a given language.

When I was a physics grad student, a friend who was a history grad student told me about his adviser who would learn each new language by the following method. He would get a copy of the bible in that language (because the bible is available in almost every language) and just start trying to read it, starting from Genesis 1. By the time he got to Revelation, he either knew the language pretty well or, if not, then he would start over and read the bible a second time.

My main idea that I want to try carrying out with Homer and/or the gospels is a presentation in which the Greek is on the left page, and then on the right you have only the translations of the most uncommon words, laid out like a ransom note so that each translation is in a similar place on the page. For instance, if you have the uncommon word βλαστάνω on the third line of Greek, near the left side of the page, then on the mostly-blank right-hand page you would have “bud, sprout,” lying at the same horizontal and vertical position so that your eye can easily find it. But if there’s some common word like καὶ, then no translation is provided.

The whole long detour into Giles and OCR was just motivated by my desire to build up some infrastructure for that project.

I assume that Frank can get away with doing stuff like The Little Prince because he’s in Russia. Much as I disapprove of Putin as part of the menacing resurfacing of autocracy, it’s great that Russia has effectively opted out of the awful modern copyright regime, which has gotten way out of control and become a deeply antisocial monster. When I was teaching, I would make sure my students knew about Library Genesis, an awesome Russian web site that has pretty much every science and math textbook for free.

dnl · June 28, 2021, 12:04am

Not wrong there - but useful. I was mostly interested in the exercises personally. But the Accordance guys have remarked that the hardest modules to create are those from grammars. A lot of variation in layout. Tables are a pain two. BTW I assume you are sticking with LTL for now or could you do Hebrew inter-mixed ? I don’t have a specific use case in mind though perhaps John Wever’s notes on the Septuagint or Rahlfs though some of his material is already available though there are PDFs of say his notes on Psalms : https://archive.org/details/PsalmiCumOdis/page/n225/mode/2up. This is pretty horrible too because of lots of symbols and multiple fonts.

As to the unaccented texts, you’ll probably want to train on both those and fully accented. And uncials, of course all in prepared unicode rather than handwritten. So a bunch of those texts as well. Depending upon how capable a final product you’ll need a bunch of training data, or multiple sets.

What I’m doing is fundamentally unlike how Tesseract works, in that there is no notion of training on a bunch of stuff and then hoping that the software will be competent to handle anything at all that you throw at it. You just show Dorcas the characters of one or two very specific fonts (e.g., the Latin and Greek fonts used in Giles), at a certain size, and that’s it. That’s all it knows and all it needs to know. So the good news is that you don’t have to send it to graduate school and hope it comes back with a broad education, but the bad news is that it has to be taught the fonts for each book it’s going to OCR. This is why I’m not so interested in just making up artificial examples as tests for the software. Training Tesseract on such a text would produce some benefit that would carry over to other texts, but using Dorcas on it would produce no benefit whatsoever aside from the testing.

Ah I see now. Ok, so then you’ll want a bunch of samples of texts in different fonts. And potentially you’ll want accented and unaccented variants. But you will then have the problem of identifying the font for the text. I guess some kind of initial test scan and measure would work.

Thx
D

bcrowell · June 28, 2021, 2:22pm

One of my goals was to handle RTL properly, and there is code in Dorcas (totally untested code) to handle things like recognizing that a word composed of Hebrew letters must be RTL and trying to handle it correctly as part of a bidi text. It would indeed be fun to try a text with Hebrew in it. But yeah, Dorcas is a specialized tool and very unsuited by its design to handling anything like that page from Rahlfs (horror!). To give you an idea of how unadaptable it is to changes of font or font size, I’m now OCRing the Giles book, and at the top of every two-page spread is a header that says HOMER’s – ODYSSEY, I. This is in a font about 20% larger than the font used for the main text, and therefore Dorcas renders it as total gibberish.

Not even that, really. The idea is not to identify the font family like Times or Porson for once and for all and hope it carries over to other books. The software is designed on the expectation that you’re going to teach it the specific hot-metal font for a particular book, scan that book, and then throw away the font. At best a particular font might serve as a starting point for learning a similar font in another book, e.g., one printer’s Porson-style Greek font might look similar to another’s. (I guess it is possible that you could come across another book that had been typeset from exactly the same font, but with these old books that come to us from across all that time and space, I imagine it would take some luck.)

bcrowell · July 13, 2021, 3:15am

In case anyone is interested, I’ve completed a semi-decent OCR of the Giles book using Dorcas: https://github.com/bcrowell/giles If you scroll around through the different files, you’ll see that the quality of the OCR varies quite a bit. On some pages it’s quite good, while on others it’s poor or completely unreadable. In some cases this seems to be because of the poor condition of certain pages of the books, which were probably something like 140 years old by the time archive.org borrowed them from UCLA and scanned them. I think the quality of the scans is not good enough that anyone would sit and read the thing in its present state, but good enough for extracting statistical information or for doing computer searches to see how a given Greek word might frequently be rendered in an English translation.

mwh · July 15, 2021, 2:04am

Probably not to your purpose but definitely qualifying as fun bilingual texts and very interesting in themselves are the “Hermeneumata pseudodositheana”(!) discussed in a slide lecture by Eleanor Dickey (mentioned on the Learning Latin board by seneca2008)
https://www.youtube.com/watch?v=0909NqMwxXI
cf. https://bmcr.brynmawr.edu/2013/2013.08.34/
and brilliantly investigated by Carlotta Dionisotti (daughter of Carlo D.) in her pioneering “From Ausonius’ schooldays” of 1982.
Greek-Latin (not English).
Enjoy!

leorc · May 12, 2022, 3:28pm

One thing to check, if you’re still looking for such things, is Evan Milner’s Latinium list of interlinears and construed texts. I also seem to have ended up compiling a list myself, including both (word-by-word/“concordant”) interlinears and Giles-like “construed texts”. I don’t know when or if it will be reasonably complete and presentable, but if you’re still looking for texts I might be able to suggest some things. (If you’re willing to buy used books or travel to libraries then there are even more options I could suggest.)

Of course, once there are good scans of every printed classics interlinear then they all have to be transcribed! So I’m absolutely interested to hear about your progress in OCR with Dorcas. I have no OCR or AI skills but maybe this is of interest, if you haven’t seen it already: a group getting decent results with Tesseract and Kraken on 19th-century commentaries on Ancient Greek texts.