open-source, machine-readable translating dictionaries

bcrowell · April 26, 2021, 3:22pm

I’m working on an open-source hobby project to try to present ancient Greek texts in innovative ways to help readers. As part of this project, it would be helpful if I could find machine-readable files, in the pubic domain or under an open-source license, that give possible English translations of Greek words. Failing that, it would help if I could find machine-readable files containing dictionary definitions. When I say machine-readable, I mean something formatted for use by a machine, not just a web page such as the markup of LSJ et al. at logeion.uchicago.edu. Does anyone know of any resource of this type? I’m especially interested in doing the Homeric vocabulary, but, e.g., koine would be helpful too.

As an example, for a word like ξίφος, I would like to have something like this:

ξίφος sword

Obviously translations are not always one-to-one, and this will not be possible for all words. So for example, we might have:

χιτών tunic coat frock covering

And many words simply won’t have any one-word English equivalents, so, e.g., a word like ἐγγίζω, “bring near,” could be left out completely, given an explanation-style definition, given a multi-word translation, or just marked to show that it has no one-word translation.

I know this all sounds sort of picky, like I’m walking onto a used car lot and demanding a purple 1967 Mustang. Actually I would be happy with any sort of approximation to these requirements if it could save me the work of constructing such a lexicon myself, e.g., by reading LSJ and constructing entries by hand. There are also open-source software libraries designed to parse wiktionary entries, so I could try that in order to save myself at least some labor. However, I really am not interested at all in anything that is not free and open-source. I want to keep my project completely free, and in particular I don’t want to contaminate it with anything that has a license such as CC-BY-NC, with a noncommercial clause.

Currently my main use for these data would not be to present them to humans but rather as a behind-the-scenes tool for me to do statistical analysis. That means that I don’t care too much if the data contain errors or are not authoritative. Even if as many as 50% of the entries were entirely wrong, it would still be usable for my purposes, although obviously not ideal. The current statistical task that I’m trying to accomplish is getting software to automatically associate sentences in Homer with sentences in the various public-domain English translations. I already have this working well for English-to-English, which is obviously easier. That is, my software can figure out pretty reliably that a certain sentence in Pope’s translation of the Iliad corresponds to a certain sentence in the Lang translation.

The best approximation I’ve found so far to what I’m talking about seems to be Ancient Greek Wordnet. There is also something called Open Ancient Greek Wordnet, which seems to be different. However, the original Wordnet and its spin-offs and copycats are not really translating dictionaries. They’re more like maps of conceptual categories and synonyms.

Another pretty interesting resource is Giles, The Odyssey of Homer : construed literally, and word for word, 1800. This is available on archive.org and consists of a Greek paraphrase of Homer (with the word-order mangled to be more like English) and English words interspersed. As an example, the opening is this:

Εννεπε declare μοι to me, Μουσα Muse, ανδρα the man …

I wouldn’t want to experience Homer by reading it in this format, but it does seem to provide a large number of word-to-word translations in Homeric Greek and in machine-readable format. However, trying to put this bilingual text through OCR software to get reasonable results seems like it would be somewhat of a project. (I’ve used the open-source Tesseract OCR system before, and it is supposed to be possible to make it work with polytonic Greek, but it probably isn’t smart enough to tell which words in Giles are Greek and which are English.)

will.dawe · April 26, 2021, 11:28pm

So, you seek for an Attic-English dictionary with very short, compact definitions? It may be difficult to build such vocabulary because many words are polysemantic. Try these resources:

It seems, that the authors of Eulexis took a dictionary and picked up the first word marked up as a translation, so sometimes their translation fail. For example, λαμβάνω — [en] take, seize, receive; [fr] une; [de] ein. Anyway, their approach is reasonable.

Also, Middle Liddell (Liddell & Scott, 1889) is more compact than LSJ (example).

Just for information, almost all digitized Ancient Greek dictionaries are collected on LSJ.gr.

bcrowell · April 27, 2021, 6:34pm

Thanks, will.dawe, that’s very helpful. Steadman and LSJ.gr don’t have licenses that are compatible with what I’m doing, but Dickinson is CC-BY-SA, and Eulexis is GFDL, both of which work for me, and they’re both also formatted in a machine-readable way. (For human-formatted dictionary entries, wiktionary seems to have pretty solid coverage and has an unproblematic license. But human-formatted dictionary entries are not directly usable for what I’m doing.)

So, you seek for an Attic-English dictionary with very short, compact definitions?

My application is computer correlation of the Greek text with English translations, so what I need is not definitions at all but lists of English words that would be likely to occur in a translation. So for example, Dickinson’s definition of οὕτως as “in this way,” is not useful to me, because an English word like “in” is not going to have a significant correlation with the occurrence of οὕτως in the Greek. But I can fairly efficiently go through a list like Dickinson’s and find entries like

πόλις πόλεως, ἡ city, city-state

that can be converted to

πόλις city,

which is useful to me, because there is likely to be a statistical correlation between πόλις and city. (I’m using a lemmatizer that would convert inflected forms like πόλει and cities to πόλις and city.)

The Eulexis data have the Greek encoded using ascii characters, e.g., ἀνίστημι as a)ni/sthmi. I wonder if this encoding is standardized, has a standard name, or is documented anywhere.

will.dawe · April 29, 2021, 4:32am

Eulexis is GFDL

It’s licensed as GPL 3.0, isn’t it?

The Eulexis data have the Greek encoded using ascii characters, e.g., ἀνίστημι as a)ni/sthmi. I wonder if this encoding is standardized, has a standard name, or is documented anywhere.

See file betunicode_gr.csv. Be careful with “GREEK beta intérieur”.

will.dawe · April 29, 2021, 1:35pm

Ancient Greek WordNet (22420 lemmas; unsupervised, precision 56%; CC BY-SA)

Online demo, choose “Greek” (it’s actually Ancient Greek) and type in words without diacritics. For example, “κορη”:

a young unmarried woman
a female human offspring
a youthful female person
a young woman
a girl or young woman with whom a man is romantically involved
an unmarried girl (especially a virgin)
a female domestic
(cricket) an over in which no runs are scored
serving to set in motion

will.dawe · May 1, 2021, 8:51pm

Parallel Translations from the English Wiktionary (Tiago, 2018)
4,308 pairs English - Ancient Greek; CSV format:

173875	4643	girl/young female person	anci1242	Ancient Greek	κόρη	f
91898	2467	love/strong affection	anci1242	Ancient Greek	ἀγάπη	f
92309	2474	love/have a strong affection for	anci1242	Ancient Greek	ἀγαπάω	
986236	36705	dance/move rhythmically to music	anci1242	Ancient Greek	χορεύω	
986237	36705	dance/move rhythmically to music	anci1242	Ancient Greek	βαλλίζω	
225948	6050	rose/flower	anci1242	Ancient Greek	ῥόδον	n
...

bcrowell · May 2, 2021, 2:10pm

Thanks, will.dawe, that’s very helpful!

It turns out that perseids.org has a machine-formatted version of LSJ that is not encumbered by a restrictive license.

I guess I misunderstood what the Ancient Greek Wordnet included. Strange, though, that they have the definition involving cricket for κόρη. I wonder if they intentionally mix modern and ancient Greek, or if that was just an error.

At this point what I’ve done is to put together several files of possible one-word correlates in which I prepared a list of English words by hand. As a new user, I’m not allowed to post links on textkit, but the files are in my github repository bcrowell/rashomon, in the subdirectory tr. I have:

240 Homeric hapax legomena that I think might have good one-word correlates, e.g., ἀμφαραβέω → rattle

379 moderately common words from the Dickinson list, e.g., ἑπτά → seven

62 words culled from my own flashcards that I think might have good one-word correlates, θώς → jackal

In many cases I’ve put in multiple translations, e.g., ἀρετή → goodness excellence virtue valor bravery. I don’t know yet whether these will be useful statistically or not. In other cases I’ve put in a list of possible correlates that do not individually give a translation of the word, e.g., ἀκοντιστύς → javelin dart contest game.

What I’m hoping to get working next is to get my software to evaluate for me how precise the correlations are. The idea is that if θώς occurs in ch. 1 of the Iliad, 5.3% of the way through the whole text, and jackal occurs in Pope’s English translation at 5.1% of the way, then this is probably a real match, whereas if the Greek word occurs at 5.3% but the English one very near the end at 97.0%, then that’s some sort of random background of false coincidences. I’m going to see if I can get the software to measure a signal-to-noise ratio for each pair of correlates. Hopefully that will give me some insight into which types of words work best. It may be that hapaxes give the most reliable initial mapping from one text to the other, but that an algorithm could then refine this with more common words.

I really appreciate your persistence in helping me find sources of data.

will.dawe · May 2, 2021, 3:10pm

You have found a very interesting source. Someone, named @thepos, have published two transcripts on Archive.org:

They are plain text (w/o mark up), but in Public Domain.

It’s very intriguing, but it would be better to contact with @thepos to be sure that this LSJ was not produced from Perseus’ files.

The key word in the description to this dataset is “unsupervised”, i. e. Greek and English meanings were matched automatically and were not proved by humans. Hence, high level of errors.

If you have two texts, Greek and English, then some “word alignment” algorithms can be applied. This topic is actively developing last years for the goals of machine translation.

bcrowell · May 2, 2021, 6:14pm

Ah, thanks for the pointer to the idea of word alignment. Since I’m just a hobbyist, there is all kinds of stuff like this that I don’t know exists, and then as soon as I know that it has a name, I can google it.

I would describe what I’m trying to do as “sentence alignment,” and I think it’s a much, much easier problem than word alignment.

bcrowell · May 9, 2021, 11:45pm

It turns out that sentence alignment is a topic that has been studied for a long time. There are algorithms by Brown and Moore, which have been implemented in open-source software packages such as mALIGNa and SMT-LowRec. Other open-source software includes hunalign and Gargantua. Some of these algorithms work better on languages that are similar, such as German-English, and poorly on more dissimilar pairs like Chinese-English. Some require a dictionary and some don’t.