Perseus Project Data Set - XML?

Textkit is a learning community- introduce yourself here. Use the Open Board to introduce yourself, chat about off-topic issues and get to know each other.
Post Reply
User avatar
Jeff Tirey
Administrator
Posts: 896
Joined: Wed Aug 14, 2002 6:58 pm
Location: Strongsville, Ohio

Perseus Project Data Set - XML?

Post by Jeff Tirey »

Has anyone here worked with the data files available for download over at Perseus? I downloaded several of them and took a quick look. The notes say they are XML files but when opening the file in an XML editor (Firefox) it won't parse. When I open in a text editor it looks like the structure is messes up an lots of data is malformed.

http://www.perseus.tufts.edu/hopper/opensource/download

I'm trying to discover all data sets out there that cover Greek and Latin word frequency and lexeme


Jeff
Textkit Founder

User avatar
Lex
Textkit Zealot
Posts: 732
Joined: Thu Apr 24, 2003 6:34 pm
Location: A top-secret underground llama lair.

Re: Perseus Project Data Set - XML?

Post by Lex »

I Googled this entity (&Perseus.publish;) that Firefox chokes on, and got these:

http://www.stoa.org/markup/

http://www.stoa.org/markup/how_markup.shtml

I suspect the files are more SGML than pure XML.
I, Lex Llama, super genius, will one day rule this planet! And then you'll rue the day you messed with me, you damned dirty apes!

User avatar
Jeff Tirey
Administrator
Posts: 896
Joined: Wed Aug 14, 2002 6:58 pm
Location: Strongsville, Ohio

Re: Perseus Project Data Set - XML?

Post by Jeff Tirey »

you're awesome Lex - thanks so much.

Now i'm exploring SMGL to JSON conversion.
Textkit Founder

edonnelly
Administrator
Posts: 989
Joined: Sun Jan 16, 2005 2:47 am
Location: Music City, USA
Contact:

Re: Perseus Project Data Set - XML?

Post by edonnelly »

Jeff Tirey wrote:Has anyone here worked with the data files available for download over at Perseus? I downloaded several of them and took a quick look. The notes say they are XML files but when opening the file in an XML editor (Firefox) it won't parse. When I open in a text editor it looks like the structure is messes up an lots of data is malformed.

http://www.perseus.tufts.edu/hopper/opensource/download

I'm trying to discover all data sets out there that cover Greek and Latin word frequency and lexeme


Jeff
Have you looked at Diogenes? It's an open-source program that uses Perl and does a lot of stuff with the Perseus dictionaries. Might give you some insight into the best way to access the Perseus data.
The lists:
G'Oogle and the Internet Pharrchive - 1100 or so free Latin and Greek books.
DownLOEBables - Free books from the Loeb Classical Library

User avatar
Jeff Tirey
Administrator
Posts: 896
Joined: Wed Aug 14, 2002 6:58 pm
Location: Strongsville, Ohio

Re: Perseus Project Data Set - XML?

Post by Jeff Tirey »

Hi edonnelly,

No I haven't I'm checking this out now and it looks very promising. Thanks!

Jeff
Textkit Founder

annis
Textkit Zealot
Posts: 3399
Joined: Fri Jan 03, 2003 4:55 pm
Location: Madison, WI, USA
Contact:

Re: Perseus Project Data Set - XML?

Post by annis »

Oh, man. φεῦ φεῦ!

Before I decided on the wiki format for Scholiastae.org, I spent some quality time with the Perseus XML. I tried running it through a validating XML parser. That was not a happy experience, mostly due to missing DTD files (local and remote both). Missing entity definitions made me give up on using data that way. Now when I pilfer an open text from Perseus, I tweak a crude, hand-rolled XML parsing engine to rip out only the bits I want.

Diogenes interacts nicely with the dictionaries, but it does not really understand the Perseus XML format — just the scary record-based file system that the TLG continues to use. I have it in my mind that someone is funding Heslin to make Diogenes work with Perseus data, too, but I cannot find proof of that right now.
William S. Annis — http://www.aoidoi.org/http://www.scholiastae.org/
τίς πατέρ' αἰνήσει εἰ μὴ κακοδαίμονες υἱοί;

User avatar
Jeff Tirey
Administrator
Posts: 896
Joined: Wed Aug 14, 2002 6:58 pm
Location: Strongsville, Ohio

Re: Perseus Project Data Set - XML?

Post by Jeff Tirey »

Hi Will,

It's really great hearing from you!!

It sounds like I'm in for some pain. Ideally, I would like to convert the Perseus morphological data to a more friendly JSON record set. My understanding of Diogenses is that it sits on top of TLG which is a licensed data. I don't have the license and I don't yet really understand how Perseus is drawn in Diogenes, I'm assuming through an API. If Mr. Hesin has plans to tackle Perseus that would be wonderful.

Jeff
Textkit Founder

annis
Textkit Zealot
Posts: 3399
Joined: Fri Jan 03, 2003 4:55 pm
Location: Madison, WI, USA
Contact:

Re: Perseus Project Data Set - XML?

Post by annis »

You might have better luck interacting with the MySQL dumps. At the very least, you could import those into MySQL, then query the data back out and dump it into a more agreeable format.
William S. Annis — http://www.aoidoi.org/http://www.scholiastae.org/
τίς πατέρ' αἰνήσει εἰ μὴ κακοδαίμονες υἱοί;

Post Reply