Textkit Logo

Perseus Project Data Set - XML?

Textkit is a learning community- introduce yourself here. Use the Open Board to introduce yourself, chat about off-topic issues and get to know each other.

Moderators: thesaurus, Jeff Tirey

Perseus Project Data Set - XML?

Postby Jeff Tirey » Thu Jan 20, 2011 6:21 pm

Has anyone here worked with the data files available for download over at Perseus? I downloaded several of them and took a quick look. The notes say they are XML files but when opening the file in an XML editor (Firefox) it won't parse. When I open in a text editor it looks like the structure is messes up an lots of data is malformed.

http://www.perseus.tufts.edu/hopper/opensource/download

I'm trying to discover all data sets out there that cover Greek and Latin word frequency and lexeme


Jeff
Textkit Founder
User avatar
Jeff Tirey
Administrator
Administrator
 
Posts: 891
Joined: Wed Aug 14, 2002 6:58 pm
Location: Strongsville, Ohio

Re: Perseus Project Data Set - XML?

Postby Lex » Fri Jan 21, 2011 1:09 am

I Googled this entity (&Perseus.publish;) that Firefox chokes on, and got these:

http://www.stoa.org/markup/

http://www.stoa.org/markup/how_markup.shtml

I suspect the files are more SGML than pure XML.
I, Lex Llama, super genius, will one day rule this planet! And then you'll rue the day you messed with me, you damned dirty apes!
User avatar
Lex
Textkit Zealot
 
Posts: 732
Joined: Thu Apr 24, 2003 6:34 pm
Location: A top-secret underground llama lair.

Re: Perseus Project Data Set - XML?

Postby Jeff Tirey » Fri Jan 21, 2011 3:12 am

you're awesome Lex - thanks so much.

Now i'm exploring SMGL to JSON conversion.
Textkit Founder
User avatar
Jeff Tirey
Administrator
Administrator
 
Posts: 891
Joined: Wed Aug 14, 2002 6:58 pm
Location: Strongsville, Ohio

Re: Perseus Project Data Set - XML?

Postby edonnelly » Fri Jan 21, 2011 1:25 pm

Jeff Tirey wrote:Has anyone here worked with the data files available for download over at Perseus? I downloaded several of them and took a quick look. The notes say they are XML files but when opening the file in an XML editor (Firefox) it won't parse. When I open in a text editor it looks like the structure is messes up an lots of data is malformed.

http://www.perseus.tufts.edu/hopper/opensource/download

I'm trying to discover all data sets out there that cover Greek and Latin word frequency and lexeme


Jeff


Have you looked at Diogenes? It's an open-source program that uses Perl and does a lot of stuff with the Perseus dictionaries. Might give you some insight into the best way to access the Perseus data.
The lists:
G'Oogle and the Internet Pharrchive - 1100 or so free Latin and Greek books.
DownLOEBables - Free books from the Loeb Classical Library
User avatar
edonnelly
Textkit Zealot
 
Posts: 959
Joined: Sun Jan 16, 2005 2:47 am
Location: Music City, USA

Re: Perseus Project Data Set - XML?

Postby Jeff Tirey » Fri Jan 21, 2011 4:26 pm

Hi edonnelly,

No I haven't I'm checking this out now and it looks very promising. Thanks!

Jeff
Textkit Founder
User avatar
Jeff Tirey
Administrator
Administrator
 
Posts: 891
Joined: Wed Aug 14, 2002 6:58 pm
Location: Strongsville, Ohio

Re: Perseus Project Data Set - XML?

Postby annis » Mon Jan 24, 2011 3:57 pm

Oh, man. φεῦ φεῦ!

Before I decided on the wiki format for Scholiastae.org, I spent some quality time with the Perseus XML. I tried running it through a validating XML parser. That was not a happy experience, mostly due to missing DTD files (local and remote both). Missing entity definitions made me give up on using data that way. Now when I pilfer an open text from Perseus, I tweak a crude, hand-rolled XML parsing engine to rip out only the bits I want.

Diogenes interacts nicely with the dictionaries, but it does not really understand the Perseus XML format — just the scary record-based file system that the TLG continues to use. I have it in my mind that someone is funding Heslin to make Diogenes work with Perseus data, too, but I cannot find proof of that right now.
William S. Annis — http://www.aoidoi.org/http://www.scholiastae.org/
τίς πατέρ' αἰνήσει εἰ μὴ κακοδαίμονες υἱοί;
annis
Textkit Zealot
 
Posts: 3397
Joined: Fri Jan 03, 2003 4:55 pm
Location: Madison, WI, USA

Re: Perseus Project Data Set - XML?

Postby Jeff Tirey » Wed Jan 26, 2011 3:35 pm

Hi Will,

It's really great hearing from you!!

It sounds like I'm in for some pain. Ideally, I would like to convert the Perseus morphological data to a more friendly JSON record set. My understanding of Diogenses is that it sits on top of TLG which is a licensed data. I don't have the license and I don't yet really understand how Perseus is drawn in Diogenes, I'm assuming through an API. If Mr. Hesin has plans to tackle Perseus that would be wonderful.

Jeff
Textkit Founder
User avatar
Jeff Tirey
Administrator
Administrator
 
Posts: 891
Joined: Wed Aug 14, 2002 6:58 pm
Location: Strongsville, Ohio

Re: Perseus Project Data Set - XML?

Postby annis » Fri Jan 28, 2011 2:03 am

You might have better luck interacting with the MySQL dumps. At the very least, you could import those into MySQL, then query the data back out and dump it into a more agreeable format.
William S. Annis — http://www.aoidoi.org/http://www.scholiastae.org/
τίς πατέρ' αἰνήσει εἰ μὴ κακοδαίμονες υἱοί;
annis
Textkit Zealot
 
Posts: 3397
Joined: Fri Jan 03, 2003 4:55 pm
Location: Madison, WI, USA


Return to Open Board

Who is online

Users browsing this forum: Google Adsense [Bot] and 24 guests

cron