I am wondering about the frequency of cases in Latin. I know genitive is least frequent and I would imagine accusative is most frequent, with nominative being a close second. What I am wondering about is ablative and dative. My inclination would be to say ablative is more frequent than dative. Thoughts?
nouns (19630):
- accusative 31.6%
- ablative 25.8%
- nominative 22.6%
- genitive 13.6%
- dative 4.6%
- vocative 1.2%
- locative 0.2%
- unknown 0.4%
adjectives (7497):
- accusative 33.0%
- nominative 31.6%
- ablative 21.8%
- genitive 8.6%
- dative 4.2%
- vocative 0.6%
- locative 0.0%
- unknown 0.0%
pronouns (6289):
- accusative 33.1%
- nominative 32.0%
- ablative 13.1%
- dative 13.1%
- genitive 8.5%
- vocative 0.1%
- unknown 0.1%
Thank you for your very thorough response. It would appear I was mistaken about genitive being the least frequent. So much for the Cambridge Latin Course’s claim to introduce the cases by frequency. ![]()
Multas gratias ago tibi.
Edit: I wonder why it is that ablative and nominative are reversed for frequency with nouns and adjectives. Perhaps it would relate to substantive usage of the adjective.
What I want to know is “unknown.” Is that like the altar to the unknown god in Acts 17? ![]()
But, frequency is really irrelevant to learning the language (though I admit it’s kind of interesting). When you see a genitive or a dative, you still have to know what it is and what it’s doing.
So much for the Cambridge Latin Course’s claim to introduce the cases by frequency.
I am interested in these results. How does the analysis distinguish between identical endings of different cases. Without giving an exhaustive list dat and abl plurals are identical in 1st and 2nd declension etc . The Latin word tool often suggests a case on the basis of “statistical methods” with no claim to accuracy.
My apologies if this has already been explained elsewhere.
The question wasn’t related to learning the language, per se. I am teaching a class on the relationship of Latin to English (and other languages) and just got into the idea of cases and inflection and thought that might be an interesting angle.
As I can see, most of the tokens with “unknown case” belong to the lemma “nihil”, such as: “Nihil agis, nihil moliris, nihil cogitas, quod non ego non modo audiam, sed etiam videam planeque sentiam.”
Other dubious words: tot, aliquot, fas, nefas, utraque, nequam, aether, Tempe, Celer, etc.
But also “mulier” in the following sentence: “et mulier quam vidisti est civitas magna quae habet regnum super reges terrae”.
Latin Dependency Treebank is said to be semi-automatically annotated, so I cannot say whether POS tags were added by humans or produced by computer.
Thanks for your reply @will.dawe.
I followed the link you gave in your first post and discovered the list of texts which seem to have been included in this analysis. I had thought that all texts in Perseus were included but discovered that only a few texts have been included as follows.
This repository contains the data for the Latin Dependency Treebank 2.1. It comprises the data available in the older version plus new texts. Information about the editions of the texts can be found within the files. The currently available texts are:
Author Text
Augustus Res Gestae
Caesar Commentarii de Bello Gallico (2.1-2.3; 2.5; 2.7; 2.9; 2.14-2.18; 2.32-2.33)
Cicero In Catilinam (1.1-2.11)
Jerome Vulgata
Vergil Aeneid (6.1-6.336)
Ovid Metamorphoses (1.1-1.778)
Petronius Satyricon (26-78)
Phaedrus Fabulae (1-3)
Propertius Elegiae (1.1-1.22)
Sallust Bellum Catilinae
Suetonius Life of Augustus (1-55)
Tacitus Historiae (1.1.21)
This selection severely undermines the validity of any conclusions that can be drawn from the results. Mixing poetry and prose is obviously a problem. But it looks like the longest text would be the vulgate which is much later text. Although changes in the use of cases over time might not be too much of a problem its something that should be investigated
I need to correct opinion of @seneca2008.
In Corpus Linguistics every calculation is correct only in relationship with the chosen corpus. One could collect legal texts of Diocletian and beat my calculations with new numbers, but the fact is their results are correct only to the chosen texts.
It’s not an easy task to build a balanced corpus representing Latin language in whole. One may choose only prose, but another will say that Latin literature is not full without poems of Horace. The solution could be choosing equal size works of different genres, such as prose, poetry, legal, religious, graffiti, et so on. But it could be protested that only small part of Romans used “legal” language and the corpus should be balanced according to the role of different genres in real use, such as more spoken and less formal. You could exclude Vulgate, but the result will represent frequencies in a certain period: Old Latin, Golden Age, Silver Age, or what period you think is a “good Latin”.
There are different corpora of English language and all of them are valid (in their sphere of application):
- Corpus of Contemporary American English
- Corpus of Historical American English
- Corpus of US Supreme Court Opinions (legal texts)
- Hansard Corpus (political)
- Corpus of American Soap Operas
t’s not an easy task to build a balanced corpus representing Latin language in whole. One may choose only prose, but another will say that Latin literature is not full without poems of Horace.
I think that the problem is that I dont fully understand what you are trying to do. We only have a tiny proportion of the Latin texts which existed in Antiquity. Those that we have are mostly in the highest literary register. If you are trying to draw inferences about the Latin used by the majority of Latin speakers then I dont think this method of analysis will be very helpful.
It is not about trying to “beat [your] calculations” it is about trying to understand what your numbers mean. Latin Grammatical discussions differentiate between poetic, colloquial and literary prose usage. The proportion of these types of Latin in any sample one takes will influence the meaning of the results. I would stress its not about the “validity” of a language type but the inferences which might be drawn from any sample you choose.
I was drawn to this thread in part because Ronolio observed that “It would appear I was mistaken about genitive being the least frequent. So much for the Cambridge Latin Course’s claim to introduce the cases by frequency”.
Without more information about what texts the Cambridge Course based its statement on, its not possible to compare your results with theirs. Different samples will possibly/probably yield different results.
I was drawn to this thread in part because Ronolio observed that “It would appear I was mistaken about genitive being the least frequent. So much for the Cambridge Latin Course’s claim to introduce the cases by frequency”.
Without more information about what texts the Cambridge Course based its statement on, its not possible to compare your results with theirs. Different samples will possibly/probably yield different results.
I looked at the bibliography in the 4th Edition Cambridge Latin Course Unit IV’s Teachers Edition and it would appear that the authors used were -with no indication of prevalence: Catullus, Cicero, Horace, Livy, Martial, Ovid, Petronius, Pliny, and Vergil.
The story line runs from Pompeii just prior to the eruption of Vesuvius (the family father is modeled on Lucius Caecilius Iucundus, the Pompeian banker possibly linked to a bronze bust -incidentally sharing the praenomen and nomen of Pliny the Younger’s father, Lucius Caecilius Cilo), to Britain during the time of Agricola, to Rome, during the middle to later parts of Domitian’s reign. Given the story range, the use of Pliny is unsurprising, as is the omission of Tacitus, given the level. I did find the omission of Suetonius, odd though, as he is generally viewed as a moderately easy author.