Stedman Project

Here you can discuss all things Ancient Greek. Use this board to ask questions about grammar, discuss learning strategies, get help with a difficult passage of Greek, and more.
Post Reply
EberP
Textkit Neophyte
Posts: 10
Joined: Thu Sep 27, 2018 7:02 pm

Stedman Project

Post by EberP » Fri Oct 26, 2018 3:55 pm

After having done the same with the Rouse's Grammar, Im currently working on a project to OCR the Stedman Modern Greek Matery (https://books.google.com.mx/books?id=Rh ... edir_esc=y) into a complete html document. The OCR process is complete, and I'm currently working through the garbage in the document and trying to make a presentable html file. Stedman is written in good katharevousa and has interesting stories using modern vocabulary, even if I do not agree with his pedagogy.
If anyone would like to participate in this work, what is the correct process to create a colabration?

User avatar
Peitho
Textkit Neophyte
Posts: 16
Joined: Fri Jun 03, 2016 12:36 am

Re: Stedman Project

Post by Peitho » Fri Nov 02, 2018 7:33 pm

Cool! I’m glad someone is doing this important work.

I’m curious about the current state of the world of Ancient Greek OCR. I tried submitting a request to the Lace project, adding Pharr’s “Homeric Greek” to the spreadsheet they use to track requests: https://docs.google.com/spreadsheets/d/ ... Ui5tjF6DLo …but about a week and a half later, the website for Lace has vanished without a trace: http://heml.mta.ca/lace/ Do you know what happened to that project? And, failing that, what are the best tools to use for Ancient Greek OCR in 2018 (particularly for mixed Greek/English texts)? Thank you!

EberP
Textkit Neophyte
Posts: 10
Joined: Thu Sep 27, 2018 7:02 pm

Re: Stedman Project

Post by EberP » Mon Nov 05, 2018 4:47 pm

Hello Peitho,

This is the best tool to work with OCR: https://github.com/tesseract-ocr/tesseract. If you have any more inquries I'm pretty sure there are people, here in the foroum, more experienced with OCR than I, but I will be also glad to help you in whatever I can.

If you like to assist me in the Stedman Project let me know (this is a link to the current status of the project:
https://melainatigris8.wordpress.com/fe ... velopment/) follow the link to the HTML page.

Thank You, Sincerely Eber.

User avatar
Peitho
Textkit Neophyte
Posts: 16
Joined: Fri Jun 03, 2016 12:36 am

Re: Stedman Project

Post by Peitho » Tue Nov 06, 2018 5:13 am

Cool, thanks!

User avatar
Peitho
Textkit Neophyte
Posts: 16
Joined: Fri Jun 03, 2016 12:36 am

Re: Stedman Project

Post by Peitho » Thu Nov 08, 2018 11:02 pm

(Mods: feel free to move my questions to a different thread so I don’t threadjack this one)
I’ve been reading a few articles, here:
Deep Learning based Text Recognition (OCR) using Tesseract and OpenCV | Learn OpenCV
Understanding LSTM Networks -- colah's blog (for some technical background)
Training Tesseract for Ancient Greek OCR (PDF)

I guess what I want to know is… where do I start?! :? I’m using Ubuntu MATE 18.04, I know my way fairly well around the command line, I’ve installed the “tesseract” and “tesseract-ocr-script-grek” packages, and I cloned the git repository at https://ancientgreekocr.org/grctraining.git. It’s hard to find any kind of “how to use Tesseract” articles online, though, that don’t assume a lot of knowledge already. I have basic questions, like… which command-line options are relevant to my use-case? Will I need to train the OCR engine (in which case, where does the training data persist? How is it invoked?), or will I be using data from that repository, or some combination of the two? etc. etc. Thanks!

User avatar
jeidsath
Administrator
Posts: 2556
Joined: Mon Dec 30, 2013 2:42 pm
Location: Γαλεήπολις, Οὐισκόνσιν

Re: Stedman Project

Post by jeidsath » Thu Nov 08, 2018 11:10 pm

Once you've installed the polytonic Greek support for tesseract, it's something like this:

tesseract <pdfname> -l grc+eng output.txt

Beyond that, you'll need to go to the tesseract documentation.
Joel Eidsath -- jeidsath@gmail.com

μὴ δ’ οὕτως ἀγαθός περ ἐὼν θεοείκελ’ Ἀχιλλεῦ
κλέπτε νόῳ, ἐπεὶ οὐ παρελεύσεαι οὐδέ με πείσεις.

User avatar
Peitho
Textkit Neophyte
Posts: 16
Joined: Fri Jun 03, 2016 12:36 am

Re: Stedman Project

Post by Peitho » Fri Nov 09, 2018 2:11 am

jeidsath wrote:Once you've installed the polytonic Greek support for tesseract
How? :sad: …Unless I did that already, with the package “tesseract-ocr-script-grek”, or “tesseract-ocr-ell” (“ell” i.e. “ελληνικά”), which I just installed. Remember, I only cloned the repository for https://ancientgreekocr.org/grctraining.git, I… don’t actually know what to do with it :oops:

I’ve installed OCRFeeder, though, per the advice here: Ancient Greek OCR on Linux I’ve gotten it started, creating a project with my PDF of Pharr’s “Homeric Greek.” I don’t really have any idea how to point it to the Ancient Greek training data, though let alone get it to recognize something as mixed English and polytonic Greek. Anyway, thanks for y’alls continued help! :)

User avatar
Peitho
Textkit Neophyte
Posts: 16
Joined: Fri Jun 03, 2016 12:36 am

Re: Stedman Project

Post by Peitho » Sat Nov 10, 2018 3:34 am

Okay, I think I got it going. OCRFeeder had too many issues, so here I’m using gImageReader as the frontend.

Image

So, as you can see, the main problem I have now is that while recognition for stretches of Ancient Greek text is great, individual Greek words followed by English text (as in the vocabulary lists) are often scanned as English. I’m curious if there’s a way to ameliorate this.

hotcajun
Textkit Neophyte
Posts: 2
Joined: Sat Jul 24, 2010 5:35 pm

Re: Stedman Project

Post by hotcajun » Mon Nov 12, 2018 4:15 pm

From the documentation in the front-end you are using:
If multiple pages are selected for recognition, the program allows the user to choose between either recognizing the full resp. manually selected area for each individual page, or performing a page-layout analysis on each page to automatically detect appropriate recognition areas.
This leads me to believe that you could perhaps mark the initial words of the vocab sections for greek-only recognition. If you dig down into that functionality (and share what you find), that might be a potential way to get what you need

Post Reply