computer searching files in ancient greek, how to?

hlawson38 · December 8, 2022, 12:36am

Does anybody know an introductory level how-to on computer searching text files in ancient Greek?

I am thinking of this: given a text file of principal parts of verbs, find a random principal part, and when requested, print on the screen the first principal part.

I am seeing this as helpful in learning the principal parts of verbs.

NolanusTrismegistus · December 31, 2022, 4:19am

I’m not sure if this is quite what you’re looking for, but you might find this useful.

Logeion ( https://logeion.uchicago.edu/ ) is an online word database for Greek and Latin that does some pretty reasonable lemmatizing of user input. Do a word search (try a verb like λέγω ). The results show its principle parts, and give you examples from several of dictionaries, and quotes right from Greek authors in the forms of other principal parts ( navigate using those arrows and select “Examples from the Corpus”). Sometimes they’ll even show examples in other dialects.

Try again searching λέξω (future active indicative of λέγω), you’ll get https://logeion.uchicago.edu/morpho/λέξω with a bunch of charts, and it looks like in the backend the system is maintaining some kind of relationship between principal parts, conjugations, moods, tenses, etc., and the lemmatized form of the word. https://logeion.uchicago.edu/about/morpho

I’m no expert in computational linguistics or Greek, but principle parts don’t seem to be predictable*, and although there are a lot of heuristics we learn to follow when studying Greek verbs to help us recognize them, we still have to learn all six principal parts for every verb; indeed, this is precisely what makes learning Greek so challenging (see Matronarde 2E p. 51 and p. 386). So, I imagine such a system you’re describing would probably involve maintaining some kind similar morphological correspondence like Logeion’s “morpho” system does, lemmatizing each word in the text, and then seeing if your word matches.

But maybe a machine learning system could tease out some pattern linguists haven’t hitherto recognized? This actually might be an interesting experiment to run, to see how many generations it takes for your ML system to figure out the correct principal parts for a given verb, and whether it can generalize that across more verb, and when given different principal parts, etc.

hlawson38 · December 31, 2022, 5:23am

Thanks to NolanusTrismegistus.

I was thinking of something more primitive, to aid brute-force memorizing of the principal parts in a list like the one in Mastronardes.

NolanusTrismegistus · December 31, 2022, 5:59am

Oh okay. Let me see if I’m understanding you right. Are you thinking it the application would work like this?

You’re given “λέξω”, and the answer would be “λέγω”.

Next turn, you’re given “ἐπέτρεψα”, and the answer would be “ἐπιτρέπω”

And so on.

If so, I could I could probably write a little program that does this in about an hour and would happily share my code with you. I think the biggest challenge would be sourcing lists of principal parts, but I could start with a subset of what’s in the Mastronarde appendix.

Let me know what you think before I decide to code anything.

hlawson38 · December 31, 2022, 6:24pm

That’s what I was imagining. Could I help by making a text file of Mastronarde’s principal-parts list, or a part of it for preliminary coding? I have no idea how much work this would be, and whether an amateur could produce something useful to you. I am a Linux user and have written two or three Bash scripts, mostly for fun.

Before posting my query, I made a sort of reconnaissance to see what might be required. I imagine starting with this: http://atticgreek.org/downloads/allPPalphabetic.pdf

First I might pull the words from the pdf file.

Then I edit the product to make a text file, one row for each verb, with its principal parts delimited, and a symbol when a part is missing in Mastronarde’s list.

I would need to do some learning, and the manual editing would take time. I’m slow at this kind of work. And I’d need to be sure to produce a suitable list. Is this the kind of thing an amateur could do? Maybe it would be best to start with the first ten verbs or so to make sure I can do this correctly, and fast enough.

hlawson38 · December 31, 2022, 7:34pm

Oh okay. Let me see if I’m understanding you right. Are you thinking it the application would work like this?

You’re given “λέξω”, and the answer would be “λέγω”.

Next turn, you’re given “ἐπέτρεψα”, and the answer would be “ἐπιτρέπω”

And so on.

If so, I could I could probably write a little program that does this in about an hour and would happily share my code with you. I think the biggest challenge would be sourcing lists of principal parts, but I could start with a subset of what’s in the Mastronarde appendix.

Let me know what you think before I decide to code anything.

That’s what I was imagining. Could I help by making a text file of Mastronarde’s principal-parts list, or a part of it for preliminary coding? I have no idea how much work this would be, and whether an amateur could produce something useful to you. I am a Linux user and have written two or three Bash scripts, mostly for fun.
[big snip]

Before investing a lot of work, this should be checked: https://ankiweb.net/shared/info/1182673372

NolanusTrismegistus · December 31, 2022, 10:04pm

I’m Linux enthusiast myself (well met!) and I know a bit about coding and processing data.

After reading your recent posts and understanding your needs, I don’t think this is too difficult for an amateur (maybe with a bit of help, which I don’t mind providing ).

Firstly, I strongly advise against using Bash for this. Bash is really designed to be used working with the Unix shell, and it’s not appropriote for this task. You’ll end up making more work for yourself – certainly much more than an amateur can manage. Most modern programming languages already have libraries for doing a lot of the things I’m about to describe. Consider a language like Python, Ruby, Node JS, etc.

Here’s a rough outline of how I think you can approach this project:

It doesn’t seem like there’s a need for a human manually transcribe principal parts, as this work is already done in that PDF file http://atticgreek.org/downloads/allPPalphabetic.pdf . I almost wonder if someone’s already serialized this table into a JSON or CSV floating around somewhere on the Internet.
Consider writing some script that processes the table in that PDF file, by iterating over each row and serializing each line and its principal parts into CSV file, making sure to preserving the column order.
You could also manually prepare the CSV file by just copy-pasting each line into a text file, and adding commas between principal parts, and newline between verbs.
Program flow could work something like this:
Load and parse the CSV into memory by using your language’s library facilities for CSV. It’ll be represented as some kind of array-of-arrays data type in your language.
Each array element will be the whole list of principal parts for a verb (think in terms of each row in that Mastronarde table), and this array element can be further indexed into to get the individual principal parts (specific cell of that row).
A random function randomly selects an array. We also cache the first index of this array, which is the first person present indicative form of the verb and will be used the solution for this problem set.
Using the same array in our previous step, we randomly select a sub-array from within that same array to get a random individual principal part to quiz the user.
This principal part is displayed to the user. User can either type in the answer and the program does a string match against the solution, or they can just hit enter to show the solution. The solution was already cached in step B, so we will using that to check against.
String matching for ancient Greek can be tricky, as it will involve Unicode/UTF-8, and accents/aspirations, and whether the user has access to such a keyboard. So, I might recommend initially not processing any user input and just displaying the answer when the user hits enter.
Steps B to D are repeated in a runtime loop. Have some easy way to exit the program. (Ctrl-C or entering ‘q’ )

Feel free to reach out over PM if you have any questions.

hlawson38 · January 1, 2023, 2:09am

Thanks Nolanus for the advice!

NolanusTrismegistus · January 1, 2023, 7:51pm

Textkit user UnsounderFiddle has prepared a table of 80% of the most common principal parts in Greek in an XLSX format, so you don’t have to worry about that PDF wrangling step.

http://discourse.textkit.com/t/greek-principal-parts-spreadsheet/19470/4

You can either convert the XLSX to CSV, or find a library in your programming language to work with XLSX.

hlawson38 · January 1, 2023, 9:41pm

Many thanks for the tip!