Stedman Project

EberP · Post by **EberP** » Fri Oct 26, 2018 3:55 pm

After having done the same with the Rouse's Grammar, Im currently working on a project to OCR the Stedman Modern Greek Matery (https://books.google.com.mx/books?id=Rh ... edir_esc=y) into a complete html document. The OCR process is complete, and I'm currently working through the garbage in the document and trying to make a presentable html file. Stedman is written in good katharevousa and has interesting stories using modern vocabulary, even if I do not agree with his pedagogy.
If anyone would like to participate in this work, what is the correct process to create a colabration?

Peitho · Post by **Peitho** » Fri Nov 02, 2018 7:33 pm

Cool! I’m glad someone is doing this important work.

I’m curious about the current state of the world of Ancient Greek OCR. I tried submitting a request to the Lace project, adding Pharr’s “Homeric Greek” to the spreadsheet they use to track requests: https://docs.google.com/spreadsheets/d/ ... Ui5tjF6DLo …but about a week and a half later, the website for Lace has vanished without a trace: http://heml.mta.ca/lace/ Do you know what happened to that project? And, failing that, what are the best tools to use for Ancient Greek OCR in 2018 (particularly for mixed Greek/English texts)? Thank you!

EberP · Post by **EberP** » Mon Nov 05, 2018 4:47 pm

Hello Peitho,

This is the best tool to work with OCR: https://github.com/tesseract-ocr/tesseract. If you have any more inquries I'm pretty sure there are people, here in the foroum, more experienced with OCR than I, but I will be also glad to help you in whatever I can.

If you like to assist me in the Stedman Project let me know (this is a link to the current status of the project:
https://melainatigris8.wordpress.com/fe ... velopment/) follow the link to the HTML page.

Thank You, Sincerely Eber.

Peitho · Post by **Peitho** » Tue Nov 06, 2018 5:13 am

Cool, thanks!

Peitho · Post by **Peitho** » Thu Nov 08, 2018 11:02 pm

(Mods: feel free to move my questions to a different thread so I don’t threadjack this one)
I’ve been reading a few articles, here:
Deep Learning based Text Recognition (OCR) using Tesseract and OpenCV | Learn OpenCV
Understanding LSTM Networks -- colah's blog (for some technical background)
Training Tesseract for Ancient Greek OCR (PDF)

I guess what I want to know is… where do I start?!

I’m using Ubuntu MATE 18.04, I know my way fairly well around the command line, I’ve installed the “tesseract” and “tesseract-ocr-script-grek” packages, and I cloned the git repository at https://ancientgreekocr.org/grctraining.git. It’s hard to find any kind of “how to use Tesseract” articles online, though, that don’t assume a lot of knowledge already. I have basic questions, like… which command-line options are relevant to my use-case? Will I need to train the OCR engine (in which case, where does the training data persist? How is it invoked?), or will I be using data from that repository, or some combination of the two? etc. etc. Thanks!

jeidsath · Post by **jeidsath** » Thu Nov 08, 2018 11:10 pm

Once you've installed the polytonic Greek support for tesseract, it's something like this:

tesseract <pdfname> -l grc+eng output.txt

Beyond that, you'll need to go to the tesseract documentation.

Peitho · Post by **Peitho** » Fri Nov 09, 2018 2:11 am

jeidsath wrote:Once you've installed the polytonic Greek support for tesseract

How?

…Unless I did that already, with the package “tesseract-ocr-script-grek”, or “tesseract-ocr-ell” (“ell” i.e. “ελληνικά”), which I just installed. Remember, I only cloned the repository for https://ancientgreekocr.org/grctraining.git, I… don’t actually know what to do with it

I’ve installed OCRFeeder, though, per the advice here: Ancient Greek OCR on Linux I’ve gotten it started, creating a project with my PDF of Pharr’s “Homeric Greek.” I don’t really have any idea how to point it to the Ancient Greek training data, though let alone get it to recognize something as mixed English and polytonic Greek. Anyway, thanks for y’alls continued help!

Peitho · Post by **Peitho** » Sat Nov 10, 2018 3:34 am

Okay, I think I got it going. OCRFeeder had too many issues, so here I’m using gImageReader as the frontend.

So, as you can see, the main problem I have now is that while recognition for stretches of Ancient Greek text is great, individual Greek words followed by English text (as in the vocabulary lists) are often scanned as English. I’m curious if there’s a way to ameliorate this.

hotcajun · Post by **hotcajun** » Mon Nov 12, 2018 4:15 pm

From the documentation in the front-end you are using:

If multiple pages are selected for recognition, the program allows the user to choose between either recognizing the full resp. manually selected area for each individual page, or performing a page-layout analysis on each page to automatically detect appropriate recognition areas.

This leads me to believe that you could perhaps mark the initial words of the vocab sections for greek-only recognition. If you dig down into that functionality (and share what you find), that might be a potential way to get what you need

Textkit Greek and Latin Forums

Stedman Project

Stedman Project

Re: Stedman Project

Re: Stedman Project

Re: Stedman Project

Re: Stedman Project

Re: Stedman Project

Re: Stedman Project

Re: Stedman Project

Re: Stedman Project