electronic vocable database to search for patterns

Hi,
I would like to automatically search through a database to find patterns like irregularities (perfect passive forms of verbs with stems in ‒ν or length of the υ from verbs in -ύω) and make a full list of all of them because nobody seems able to give me one and I want them ALL xD. And I could post these lists here in the forum.
Of course it can only be as complete as the database is, so I took a pdf the greatest dictionary I know of, Liddell and Scott’s, and tried to extract the text out of it. However all the tools like pypdf or some websites I tried failed and mixed letters so e.g. “ἀγαπάω” became “ἀγαπάψ”(sic). I could try to find patterns but I don’t know if there are any or this is just random. Also by what I know about unicode the letters that are not needed in Modern Greek, especially polytonic ones, were added later separately. That could mean the tool would probably mix Omega for Psi also in Modern Greek.
I do not know much Modern Greek to google it in modern Greek (because if it occurs in Modern Greek, people encountering this problem are likely to discuss it in Modern Greek).
I like the idea of extracting the text of a pdf because it would make my lists independent of some database or dictionary if a word is missing in some.
However if someone knows another sort of database I could search through, it would be not that bad too.
If someone encountered the same problem of handling Ancient Greek in pdf, maybe they can help me, if they want.


Χαίρετε
Ἀπὸ ταὐτομάτου ὑπολογιστρῷ ἐκ συναγωγῆς ήλεκτρικῆς λέξεις κατ’εἶδος ἐξερευνῶν ἐθέλω ἀνομίας (ὡς τὰ παθητικὰ παρακείμενα ῥιζῶν εἰς -ν ῥήματα ἢ το μῆκος τοῦ ὖ τῶν εἰς ύω ῥήματα)καταλέγων, ὡς δὲ εἰς κατάλογον ἐντελῆ συνεργασόμενος, ὅτι οὐδεὶς ἄλλος τοιοῦτον παρέρχειν δύναται καὶ ΠΑΣΑΣ βούλομαι xD κἄτι τὰσδε τῇδε ὑμῖν παρέχοιμι.
Δηλονότι μὲν ὁ καταλογος μόνον τοιοῦτόν γε ἐντελὴς οἷον ἡ συναγωγή. Οὐκοῦν δὲ pdf τῆς μεγίστης ὑπ’εμοῦ γνωρίμου συναγωγῆς βιβλικῆς ἑλών καὶ τὰ στοιχεῖα μὲμ εκδιαττᾶν πειρήσας, πάντα δὲ κεχρημένα (ὡς pypdf ἢ τοποθεσίαι τινές) ἥμαρτεν καὶ τὰ ἄλλα στοιχεῖα συνέμειξε ἄλλοις (οῖα το ἀγαπάψ ἀντὶ τοῦ ἀγαπάω). Δυνατὸν νόμους τῆς συμμίξεως εὑρεῖν, ἀλλ’οὐκ οἶδα εἰ εἴσιν ἢ μειγνύασι τωὐτόματον. Καὶ καθ’ὃ ἐπίσταμαι τὰ ἐπὶ τὴν νέαν ἑλληνικὴν ἄχρηστα στοιχεῖα, μάλιστα τὰ πολυτονικὰ, ὕστερα προστέθειται, ὥστε δυνατὸν τὸ ψεῖ τῷ ὦ καὶ νεοελληνιστὶ μείγνυται. Ἐπεὶ ἱκανῶς ἀγνοῶ τὴν νέαν, μὴ google ζητήτρῳ ἐρωτῴην (,ὅτι εἰ γίγνοιτο λόγοις νεοελληνικοῖς ἀνθρώπους ἐμπεσόντας εἰκὸς διαλέγειν νεοελληνιστί).
Ἐκδιαττῶν δὴ pdf λόγων δοκεῖ μοι καλόν, ὅτι ἀσχτετοῖ τοὺς ἐμοὺς διαλόγους ἐξ ἀτελείας ἢ χήτεος λέξεων ἐν συναγωγαῖς τισιν.
Ἀλλ’εἴ τις γιγνώσκει ἄλλον τι τρόπον συναγωγῆς ἠλεκτρικῆς πλὴν βίβλου pdf, μὲ ὠφελοίη εἰ βούλεται.

Try this instead of pdf extraction: https://archive.org/details/Lsj--LiddellScott

Alternatively, you could use Python to extract this information from LSJ directly. You can find the LSJ in xml format at various locations – try searching GitHub.

Thank you very much.
I could download a tar ball and use it as a text file and I was already able to select the ύω-verbs!!

Thats why it wouldnt make a difference to use xml as long as the description for the word record is not ordered better. It would be generally nicer if e.g. the forms of a verb were ordered and not put random in the text but that does not seem to be the way lsj was compiled.