Perseus Project Data Set - XML?

Jeff_Tirey · January 20, 2011, 6:21pm

Has anyone here worked with the data files available for download over at Perseus? I downloaded several of them and took a quick look. The notes say they are XML files but when opening the file in an XML editor (Firefox) it won’t parse. When I open in a text editor it looks like the structure is messes up an lots of data is malformed.

http://www.perseus.tufts.edu/hopper/opensource/download

I’m trying to discover all data sets out there that cover Greek and Latin word frequency and lexeme

Jeff

Lex · January 21, 2011, 1:09am

I Googled this entity (&Perseus.publish;) that Firefox chokes on, and got these:

http://www.stoa.org/markup/

http://www.stoa.org/markup/how_markup.shtml

I suspect the files are more SGML than pure XML.

Jeff_Tirey · January 21, 2011, 3:12am

you’re awesome Lex - thanks so much.

Now i’m exploring SMGL to JSON conversion.

edonnelly · January 21, 2011, 1:25pm

Have you looked at Diogenes? It’s an open-source program that uses Perl and does a lot of stuff with the Perseus dictionaries. Might give you some insight into the best way to access the Perseus data.

Jeff_Tirey · January 21, 2011, 4:26pm

Hi edonnelly,

No I haven’t I’m checking this out now and it looks very promising. Thanks!

Jeff

annis · January 24, 2011, 3:57pm

Oh, man. φεῦ φεῦ!

Before I decided on the wiki format for Scholiastae.org, I spent some quality time with the Perseus XML. I tried running it through a validating XML parser. That was not a happy experience, mostly due to missing DTD files (local and remote both). Missing entity definitions made me give up on using data that way. Now when I pilfer an open text from Perseus, I tweak a crude, hand-rolled XML parsing engine to rip out only the bits I want.

Diogenes interacts nicely with the dictionaries, but it does not really understand the Perseus XML format — just the scary record-based file system that the TLG continues to use. I have it in my mind that someone is funding Heslin to make Diogenes work with Perseus data, too, but I cannot find proof of that right now.

Jeff_Tirey · January 26, 2011, 3:35pm

Hi Will,

It’s really great hearing from you!!

It sounds like I’m in for some pain. Ideally, I would like to convert the Perseus morphological data to a more friendly JSON record set. My understanding of Diogenses is that it sits on top of TLG which is a licensed data. I don’t have the license and I don’t yet really understand how Perseus is drawn in Diogenes, I’m assuming through an API. If Mr. Hesin has plans to tackle Perseus that would be wonderful.

Jeff

annis · January 28, 2011, 2:03am

You might have better luck interacting with the MySQL dumps. At the very least, you could import those into MySQL, then query the data back out and dump it into a more agreeable format.