Has anyone here worked with the data files available for download over at Perseus? I downloaded several of them and took a quick look. The notes say they are XML files but when opening the file in an XML editor (Firefox) it won’t parse. When I open in a text editor it looks like the structure is messes up an lots of data is malformed.
Have you looked at Diogenes? It’s an open-source program that uses Perl and does a lot of stuff with the Perseus dictionaries. Might give you some insight into the best way to access the Perseus data.
Before I decided on the wiki format for Scholiastae.org, I spent some quality time with the Perseus XML. I tried running it through a validating XML parser. That was not a happy experience, mostly due to missing DTD files (local and remote both). Missing entity definitions made me give up on using data that way. Now when I pilfer an open text from Perseus, I tweak a crude, hand-rolled XML parsing engine to rip out only the bits I want.
Diogenes interacts nicely with the dictionaries, but it does not really understand the Perseus XML format — just the scary record-based file system that the TLG continues to use. I have it in my mind that someone is funding Heslin to make Diogenes work with Perseus data, too, but I cannot find proof of that right now.
It sounds like I’m in for some pain. Ideally, I would like to convert the Perseus morphological data to a more friendly JSON record set. My understanding of Diogenses is that it sits on top of TLG which is a licensed data. I don’t have the license and I don’t yet really understand how Perseus is drawn in Diogenes, I’m assuming through an API. If Mr. Hesin has plans to tackle Perseus that would be wonderful.
You might have better luck interacting with the MySQL dumps. At the very least, you could import those into MySQL, then query the data back out and dump it into a more agreeable format.