Looking for Classical Greek OCR Software

LSorenson · March 20, 2009, 2:56am

Hello all,

I am looking for the best OCR software for digitizing scans of books with ancient Greek fonts (which contain accents and breathing marks, diaeresis, etc.) Does anyone have experience with using a particular OCR engine, and if so, how much “training” of the software was required and did certain printed publications (i.e. specific fonts), have a harder time correctly digitizing than others.

Louis Sorenson

P.s. Perhaps there should be a Textkit “Technical forum” for issues relating to digital, Greek/Latin on the computer, Unicode, fonts, etc. Technology links an and software questions really do not have an appropriate forum…I have no idea to which forum to post the above question.

edonnelly · March 20, 2009, 12:37pm

I’ve heard of, but never tried, Anagnostis:

http://www.ideatech-online.com/products.html

It claims full support for polytonic greek. It looks like they have a demo version that you can try (but not save results).

Last time I looked into this, I discovered some people were using Abby FineReader:

http://www.abbyy.com/

It could handle some accents/breathing but not all by default. It could be trained to learn the others, though, and I’m sure you can search and find how successful others have been with this. I’m not sure if they’ve made any progress in that direction, but they also have a trial version which I tried once, but I can’t remember its limitations.

LSorenson · March 22, 2009, 10:45pm

Well I tried the download of Anagnostis. My Mcafee SiteAdvisor warned me that the site was unsafe. The program downloaded fine, but it seems that it must have a different codebase/regional language setting. Many of the dialogs were in gibberish, I switched the Windows regional language to Greek, and some of the dialogue windows were in Modern Greek, some were still Gibberish. I gave up in the end.

I own an old copy of Omni-Page Pro, and wonder if version 16 could solve the problem it has no known support for Greek. I’m trying to get the 1922 Abbott-Smith “A Manual Greek Lexicon of the New Testament” converted into a digital Creative-Commons non-commercial xml format so it can be served up on the web. So the book I am converting/scanning is a single text, and the software could be trained, I suppose.

There is also the ReadIris http://www.irislink.com/c2-1532-17/Readiris-12-Pro---OCR-Software.aspx software. The Pro version supports modern Greek, there is also a Middle-East version which supports Hebrew and Arabic script (Syriac?), but I do not know if ancient Hebrew vowel points are supported. Most OCR software does not give trials, it seems.

Louis

grdSavant · March 23, 2009, 7:41am

I downloaded Anagnostis from site: http://anagnostis.ideatech-s-a.qarchive.org/ and I got no warnings or errors. Before I installed it I tested it with my Anti-virus software and it seems okay.

I installed it, and ran it and everything is in English, so it seems alright, ready to go.

But, I have no scanner. Anything that could even almost scan polytonic Greek would be great.

So, could someone with a scanner give this thing a test drive and let us know if it decodes the scan?

Thanks – jerry

edonnelly · March 23, 2009, 11:35am

If you could live with public domain instead of creative-commons, then you might want to have a look at Distributed Proofreaders, or maybe even better, Distributed Proofreaders Europe (apparently better at unicode and different languages). Anyway, those are the official sites for generating content for Project Gutenberg. You would have a team of volunteers to help you. Also, you should note that the actual digitization and OCR process is really the easiest part. The difficult part is that if you really want to create something of great value then each page (really each letter) needs to be carefully checked against the original. Anyway, these sites have produced over 15,000 books this way, so you would have a vast experience to benefit from. It’s easy to think you can take on a task like this alone, or with a small band of volunteers, but it is a huge amount of very tedious work, and your enthusiasm may diminish over time. (Anyone know how that Latin-Latin dictionary is going?]

grdSavant · March 23, 2009, 12:12pm

Well, I found a .bmp image and tried Anagnostis (free trial, can’t save file after recognition). It did a tolerably (?) decent job:
Without any training it was recognized as:
Which isn’t too bad.

LSorenson · March 24, 2009, 1:21am

Thanks for the links.

Well I downloaded Anagnostis from the given link a second time. This time the program loaded fine and had no gibberish. I was not impressed with the conversion - it was virtually worthless for Abbott-Smith’s lexicon. I have uploaded a file with some of the images for Abbott-Smith and a selection of about 50 words (with their corresponding images). A pdf of the 1922 lexicon is at Google Books at http://books.google.com/books?id=E-kUAAAAYAAJ&printsec=frontcover&dq=abbott-smith&ei=QzHISZ6oFIToM-vmnPAK. Abbott-Smith’s lexicon also has Hebrew - but it’s not at the number that an OCR engine would be needed.

The document containing the test words is 8.4 mg and can be found at http://www.letsreadgreek.org/resources/. The file is named AbbottSmith_TestWords.doc.

Louis

grdSavant · March 24, 2009, 7:17am

From what I see in your file, AbbottSmith_TestWords.doc, I don’t understand what you are talking about, Louis. You seem to have had nearly 100% success with the conversion. Apparently your file has been extensively edited, already, because it shows few errors. If you are showing raw results, it’s amazing.

LSorenson · March 24, 2009, 12:31pm

The word doc is edited - all the Greek and Hebrew was typed by hand. – the OCR engine used by the Microsoft document imaging brought in no Greek or Hebrew.

I’m going to put up online of extracts of pages from the original Abbott-Smith pdf and some scans from my reprint and see if anyone who has any of the the various OCR software packages can get one of them to work better than the other.

Louis

petersig · March 25, 2009, 1:18pm

I, too, am looking for an OCR program for Ancient Greek. Today I’ve been playing around with the crippleware version of Anagnostis, the most expensive of the alternatives. After a number of false starts, I did – I thought – get the program to recognize the first paragraph of Obadiah (AVDIOU). Since “crippleware” means that I can’t save the output, or the intermediary files, I used another OCR program that is free, and pretty neat, to grab the graphic output of Anagnostis, and save it to the Windows clipboard.

The program Anagnostis seems to allow fairly fine control, and reports 100 % knowledge of what is represented by most of the letters at which it “looks.” Analyzing the graphic, though, some parts of the material don’t come out at all well. I thought I had trained the program to make close distinctions between various accents with underlying letters.

That part I’ll have to rethink, because it’s possible that the program separates the accent from the underlying letter and carries them separately. The way I did it, the ultimate result was that many vowels had several accents.

Tricks in working with the Demo version, that I’ve been calling “crippleware:”

I had to use the WIA program for acquisition, since my combination of software and hardware yielded a fatal error when using TWAIN acquisition.
The help section is very nicely written. BUT, there is no information on “what buttons to push.” That, it turns out, is because the “buttons” are in little popups.

The output looks nice, even if it’s incorrect. That is, the program allows you to specify “Ancient Greek,” and you may use Unicode fonts. A Unicode font that contains Greek, such as Palatino Linotype, looks good as output.

Now that I have some sense of what Anagnostis can do, I will try out some other OCR software, since I really need this stuff. I need to render the variants from the apparatus to a critical edition of the Greek biblical text. Variants are extremely complex in terms of the fonts and symbols they use, and I doubt that any OCR software can deal with them, but I’ll try.

And I’ll keep reporting here – for lack of any other forum that’s turned up, with interest in Ancient Greek and OCR.

All the best,
S Peterson

LSorenson · March 25, 2009, 11:06pm

I downloaded and am trying CVision’s Maestro Recognition Server version 4.0.339. It has a 30 day evaluation. I don’t know if its an industrial solution with an industrial price. The trial version allows up to 5000 scans with the evaluation. There is a Greek dictionary that can be added, and you can add a second language. Currently I am stuck with a “Batch error” - I think it may be a Windows permission issue. I’ll keep all posted.

I will try to get tabs on the various programs out there. The most popular and available Business programs are:
OminPage Pro 16
Abby Fine Reader
ReadIris

Greek: Anagnostis (supposedly there is a Greek only version for $58 - but does it do English also?). Goofing around with the progam demo, I was unable to change any scan depth. You really need 600 dpi for polytonic Greek.

I think I can upgrade my old Omnipage Pro for $199.

Louis

annis · March 25, 2009, 11:58pm

I’m aware of no OCR system that copes with Greek correctly with satisfying regularity. At best it can speed up your first pass, but any text coming out the other end will still have to be carefully edited. For the longest time the Perseus people didn’t even attempt OCR, but instead hired out transcription by hand (double-entry, if I recall correctly). These days they’re adding a clever trick for the second pass — run the text through their morphological and lexical tools. If a word doesn’t come back as a likely real word, it gets special attention.

My blood runs cold at the thought of what an OCR system will do with an apparatus criticus.

petersig · April 5, 2009, 8:11pm

This past week I did try out Anagnostis 4.1 on the apparatus criticus for the Goettingen Septuaginta of Obadiah. Anagnostis has Ancient Greek, Modern Greek, and Greek/English modes, so it was worth a try. There are also some tricks of the trade floating around as lore in the UPenn Center for Computer Analysis of Texts. I hoped that some combination of software provisions and lore would allow passable OCR of the AppCrit. Here is what I found:

First of all, I have to qualify my findings by this: I was using the crippleware trial version of Anagnostis 4.1. It does not allow for saving even intermediate files. As it turned out, it did not allow me to save a training file I constructed myself, that included some CCAT tricks, like making substitute readings for the Gothic MT, or the critical apparatus siglum that’s a circle with a dot inside. If I could have saved a first-pass training file, then revised it in subsequent runs, I might have produced a passable reading of the app crit in stages. The stages I envisioned were:

Read the Greek, skip the English words, correct the numbers – SAVE the training file
Read, using Greek/English, the English words in the same file – SAVE the same training file.
Run the resulting training file for OCR and correct the results. – SAVE the results
Possibly darken or correct the underlying text to enhance OCR and SAVE.

When CCAT used one of the original Kurzweil scanners, it was possible to train it, through iterations and cleverness, to read Hebrew and Greek texts as Michigan-Claremont and Beta code. What they did was train the scanner/OCR to misread the Hebrew or Greek as their closest English (Latin) letters in appearance, then run search-and-replace programs to substitute other English (Latin) letters that actually represented the underlying Greek or Hebrew text.

It would be nice to be able to do that more directly in Unicode. I think the combination of Ancient Greek and Latin letters in the apparatus would defeat the old CCAT approach, though.

Anagnostis looks like a very sophisticated program. And, there’s no way of knowing without being able to save intermittent files – say for two days – to completely evaluate the usability of Anagnostis for one’s scholarly purposes. What I could tell was that Anagnostis 4.1 could do a passable job of distinguishing Ancient Greek from Arabic numerals.

Sigrid Peterson
CCAT/CATSS Variants Project
petersig@ccat.sas.upenn.edu

Rothbardian · May 15, 2009, 5:01am

I spent a couple of hours training Tesseract, and I’m getting extremely good accuracy on some horrible scans.

tico · May 17, 2009, 12:07pm

I spent a couple of hours training Tesseract, and I’m getting extremely good accuracy on some horrible scans.

Rothbardian, could you please give us more details? Is there a Windows version of Tesseract? Googling didn’t help me to find an installer.

Rothbardian · May 28, 2009, 4:10am

I use Linux, but Windows is listed as a supported platform on http://code.google.com/p/tesseract-ocr/
(http://tesseract-ocr.googlecode.com/files/tesseract-2.01.exe.tar.gz seems to be a Windows executable). You’ll probably also want python and TesseractTrainer.py to train it.

LSorenson · July 12, 2009, 6:48pm

I recently purchased ReadIris Pro 12 (Build 5644). I got it for $65 from the normal price of $129 because of an HP deal from my printer. It took a little while to figure out the interface (I think OmniPage Pro has a little more depth in regard to saving the layout, etc. But Abbott-Smith’s lexicon is a very simple format (A title header and the text body).

I am using a pdf version of Abbott-Smith’s 1922 lexicon. I have two pdf versions, one I purchased which is clear (no bleeding of the text as is common in some old books). This version however has one clear page (white background) and on page which is off-colored (sort of an ivory). Kind of odd for a pdf. The other pdf file is the one from Google Books. The background color for all pages is white, but the text is bleeding.

I used Adobe Acrobat Pro 9 to extract each page as a separate pdf and then tried to open it with Readiris. ReadIris did not like the ivory background pages. I had to do some tweaking with resolution settings to get the text to show - otherwise it opened up the image as a blank page. I never got the ivory pages to open directly with text from the pdf file. I had to convert them to tiff files first and then ReadIris opened them fine.

After configuring a new dictionary to use for scanning, I began to train the software. One of the problems of Abbott-Smith is that it also contains Hebrew characters. ReadIris has a middle-eastern version; but the version I am using did not support Hebrew. It did manage to learn some of the characters but continues to be problematic with the diacriticals. The bleeding text was too difficult to train the software with, so I gave up on Google Book’s copy.

One of the problems with ReadIris is that while the image window for for the ‘unknown text’ is very very large, the space for the character which you can change is extremely small - perhaps a font size 8, leaving Greek and Hebrew diacritics minuscule. It is almost impossible to determine what ReadIris thought the character was or what you typed in. ReadIris returns either a single character or mix of characters thinking it either 1) knows it, e.g. ‘ρ’ or if it is wrong ‘p’ or 2) does not know it ’ ’ (blank)) Some double glyphs were returned and with the correct suggestions; those old Latin double letters like ff AE etc. almost always came back correct.

There were several letters in Abbott-Smith which are composed of two parts (h, w, n). ReadIris never got these correct and treated them as several letters - I did not have a choice to merge two glyphs into a single glyph. I had to hope that occasionally ReadIris would catch one and give me both parts of the same letter at once, which it did once in a while. But it still never learned them. This should be a feature which should be added to the software.

ReadIris was very accurate with numbers. Abbott-Smith uses a mixture of large numbers for the chapters and raised smaller numbers for the verse. These were almost always correct - although sometimes it would keep them as raised and other times turn them into large numbers. Some of the Greek fonts for alpha and its combinations ἀ ἁ ἂ ἃ ἆ ᾶ etc. would end up being font 22, while others would turn into font size 10 - I could not figure this out - (I used the Save as Word option) and it was in the Word document where this happened.

As far as learning Greek, ReadIris got the simple letters correct, almost always got the vowels with the acute accents correct, but could not get a handle on the different diacriticals that appear over alpha, even after being trained numerous times (I went through about 6 pages of Abbott-Smith character by character).

So I plan on doing a little more training, but am going to take a look act Tessaract when I get a chance.

Louis

hotcajun · July 24, 2010, 5:46pm

Rothbardian,

Could you PLEASE make available to me (or others) the files that have resulted from your training in Tesseract on Greek. I would like to test some OCR on a google book pdf that I found for Lysistrata. Having a hard time training Tesseract on Windows.

Thanks
Mark

katzenjammer · August 22, 2012, 6:39pm

Hey all - anyone ever make any progress with this? Good heavens this would be mighty useful!

gkm · February 11, 2015, 10:30pm

Greeting to all

For OCR
(http://wiki.digitalclassicist.org/How_do_I_scan_Greek_text_as_text_(OCR))

For Handwritten character recognition (by Benjamin Milde)
(http://shapecatcher.com/)

Note:
The Gamera Project is very interesting, because it can be TRAINED to learn ATTACHED(LINKED) letters, such as those that could be found in 1700s printed editions.

Georgios K. MICHALAKIS
My motto:

Oὐ γὰρ πράξην ἀγαθὴν, ἀλλὰ καὶ εὖ ποεῖν αὐτὴν.
It does not suffice to do good (to other) - one must do it well.
Il ne suffit pas de faire le bien, mais le bien faire.

Other variants incude:
Oὐ γὰρ ἐν τῷ πράττειν τὸ ἀγαθὸν ἀλλ’ἐν τῷ εὖ ποεῖν αὐτὸ.
Πράξην ἀγαθὴν, εὖ ποεῖν αὐτὴν.
Ἀγαθὴν πράξην εὖ ποίησον ταύτην.

Does anyone have some other suggestion for a better variant?