I’ve posted previously about my goal of coming up with innovative ways of presenting Greek texts for people who aren’t fluent Greek readers:
http://discourse.textkit.com/t/open-source-machine-readable-translating-dictionaries/18385/1
As described in that thread, this led me to look for machine-readable data giving statistical correlations between Greek words and the English words that would tend to occur in an English translation. For this purpose, there was a really nice resource I located, which was a version of the Odyssey by someone named Giles, in which he twisted the Greek syntax around to be more like English, and then provided English words interspersed with the Greek:
http://discourse.textkit.com/t/describing-the-style-of-this-greek-font/18439/1
That led me to want to OCR the book, which it turns out was not really possible with the most widely used open-source OCR software, called Tesseract. Although Tesseract does theoretically support multilingual texts, in reality it doesn’t seem to do this well enough to be useful. Since I’m recently retired and have lots of free time now, I decided to write my own OCR software that would be better adapted to this sort of thing, using a qualitatively different design for the technical innards. This software, which I call Dorcas, is now working pretty well:
https://github.com/bcrowell/dorcas
“Working pretty well” means that it’s a pretty decent pre-alpha version, i.e., it’s probably not yet practical for other people to download it, read my documentation, and give it a test drive, but it does produce decent scans. Here’s a sample from Giles:
(Giles uses a weird accentuation system in which he includes breathing marks and those little iota dinguses underneath letters, but he omits all other accents.) Here’s the OCR output from my software:
===============================
appearance ξεινω Μεντη to thoetranger Mentes
ἡγητορι leader Υαφιων of the Taphians Εὑρε
δ αρα und she found then μνηστηρας αγηνορας
the haughty suitors οἱ μευ they επειτα then
ετερπον were delighting θυμον their mind πεσ
σοις with draughts προπαροιθεν in front θυραων
of the doors ἡμενοι sitting ευῥινοισιν un skins
βοων of bulls οὑς which αυτοι themselves
εκτανον slew Κηρυκες δε but heralds και and
οτοηροι θεραποντες busy attendants αυτοισιν
upon them οἱ μευ ορα some indeed εμισγον
were mixing ρινον wine και ὑδωο und water ενι
κρητηρσιν in bowls ο δε but others αυτε again
νιζον were wiping τοαπεζας thotablezo πογγοισι
πολυτοητοισι with their porons sponges και
and προτιθεντο were putting them forwards
και and δατευντο were dealing forth πολλα
κρεα many meats
So you can see that it’s certainly not perfect, but is for the most part reasonably accurate and intelligible. The quality of the OCR for the Greek is actually quite a bit better than for the English.
I’ve spent a long time using this particular book by Giles, and its fonts, as a test case while developing the software. At this point, I think it would be a really good exercise for me to go through the whole process with a new text that uses a different font. This would help me to test the software, and get it to a state where it would be more likely to be usable by other people.
Does anyone have a bilingual text that I could use for this purpose? Ideally this would be something fun to work with for its own sake, because you and I would not be doing any of this unless it was fun. Maybe there’s some old multilingual book that you really like, but it was always inconvenient to use because it was only page scans on archive.org. It doesn’t have to be Greek and English interspersed on the same line, as in Giles. If you have something that’s an interlinear translation, that could be fine. However, my software does require that it have regular, fairly even line spacing, all in a consistent font size. Ideally this would be something that has no OCR’d version available, so that there would actually be some benefit to OCRing it. If there’s a proprietary version of an old bilingual book, and you’d like to see it scanned and made freely available without copyright restrictions, that could be a lot of fun.