Has anyone here worked with the data files available for download over at Perseus? I downloaded several of them and took a quick look. The notes say they are XML files but when opening the file in an XML editor (Firefox) it won't parse. When I open in a text editor it looks like the structure is messes up an lots of data is malformed.
http://www.perseus.tufts.edu/hopper/opensource/download
I'm trying to discover all data sets out there that cover Greek and Latin word frequency and lexeme
Jeff
Perseus Project Data Set - XML?
- Jeff Tirey
- Administrator
- Posts: 896
- Joined: Wed Aug 14, 2002 6:58 pm
- Location: Strongsville, Ohio
Perseus Project Data Set - XML?
Textkit Founder
- Lex
- Textkit Zealot
- Posts: 732
- Joined: Thu Apr 24, 2003 6:34 pm
- Location: A top-secret underground llama lair.
Re: Perseus Project Data Set - XML?
I Googled this entity (&Perseus.publish;) that Firefox chokes on, and got these:
http://www.stoa.org/markup/
http://www.stoa.org/markup/how_markup.shtml
I suspect the files are more SGML than pure XML.
http://www.stoa.org/markup/
http://www.stoa.org/markup/how_markup.shtml
I suspect the files are more SGML than pure XML.
I, Lex Llama, super genius, will one day rule this planet! And then you'll rue the day you messed with me, you damned dirty apes!
- Jeff Tirey
- Administrator
- Posts: 896
- Joined: Wed Aug 14, 2002 6:58 pm
- Location: Strongsville, Ohio
Re: Perseus Project Data Set - XML?
you're awesome Lex - thanks so much.
Now i'm exploring SMGL to JSON conversion.
Now i'm exploring SMGL to JSON conversion.
Textkit Founder
-
- Administrator
- Posts: 989
- Joined: Sun Jan 16, 2005 2:47 am
- Location: Music City, USA
- Contact:
Re: Perseus Project Data Set - XML?
Have you looked at Diogenes? It's an open-source program that uses Perl and does a lot of stuff with the Perseus dictionaries. Might give you some insight into the best way to access the Perseus data.Jeff Tirey wrote:Has anyone here worked with the data files available for download over at Perseus? I downloaded several of them and took a quick look. The notes say they are XML files but when opening the file in an XML editor (Firefox) it won't parse. When I open in a text editor it looks like the structure is messes up an lots of data is malformed.
http://www.perseus.tufts.edu/hopper/opensource/download
I'm trying to discover all data sets out there that cover Greek and Latin word frequency and lexeme
Jeff
The lists:
G'Oogle and the Internet Pharrchive - 1100 or so free Latin and Greek books.
DownLOEBables - Free books from the Loeb Classical Library
G'Oogle and the Internet Pharrchive - 1100 or so free Latin and Greek books.
DownLOEBables - Free books from the Loeb Classical Library
- Jeff Tirey
- Administrator
- Posts: 896
- Joined: Wed Aug 14, 2002 6:58 pm
- Location: Strongsville, Ohio
Re: Perseus Project Data Set - XML?
Hi edonnelly,
No I haven't I'm checking this out now and it looks very promising. Thanks!
Jeff
No I haven't I'm checking this out now and it looks very promising. Thanks!
Jeff
Textkit Founder
-
- Textkit Zealot
- Posts: 3399
- Joined: Fri Jan 03, 2003 4:55 pm
- Location: Madison, WI, USA
- Contact:
Re: Perseus Project Data Set - XML?
Oh, man. φεῦ φεῦ!
Before I decided on the wiki format for Scholiastae.org, I spent some quality time with the Perseus XML. I tried running it through a validating XML parser. That was not a happy experience, mostly due to missing DTD files (local and remote both). Missing entity definitions made me give up on using data that way. Now when I pilfer an open text from Perseus, I tweak a crude, hand-rolled XML parsing engine to rip out only the bits I want.
Diogenes interacts nicely with the dictionaries, but it does not really understand the Perseus XML format — just the scary record-based file system that the TLG continues to use. I have it in my mind that someone is funding Heslin to make Diogenes work with Perseus data, too, but I cannot find proof of that right now.
Before I decided on the wiki format for Scholiastae.org, I spent some quality time with the Perseus XML. I tried running it through a validating XML parser. That was not a happy experience, mostly due to missing DTD files (local and remote both). Missing entity definitions made me give up on using data that way. Now when I pilfer an open text from Perseus, I tweak a crude, hand-rolled XML parsing engine to rip out only the bits I want.
Diogenes interacts nicely with the dictionaries, but it does not really understand the Perseus XML format — just the scary record-based file system that the TLG continues to use. I have it in my mind that someone is funding Heslin to make Diogenes work with Perseus data, too, but I cannot find proof of that right now.
William S. Annis — http://www.aoidoi.org/ — http://www.scholiastae.org/
τίς πατέρ' αἰνήσει εἰ μὴ κακοδαίμονες υἱοί;
τίς πατέρ' αἰνήσει εἰ μὴ κακοδαίμονες υἱοί;
- Jeff Tirey
- Administrator
- Posts: 896
- Joined: Wed Aug 14, 2002 6:58 pm
- Location: Strongsville, Ohio
Re: Perseus Project Data Set - XML?
Hi Will,
It's really great hearing from you!!
It sounds like I'm in for some pain. Ideally, I would like to convert the Perseus morphological data to a more friendly JSON record set. My understanding of Diogenses is that it sits on top of TLG which is a licensed data. I don't have the license and I don't yet really understand how Perseus is drawn in Diogenes, I'm assuming through an API. If Mr. Hesin has plans to tackle Perseus that would be wonderful.
Jeff
It's really great hearing from you!!
It sounds like I'm in for some pain. Ideally, I would like to convert the Perseus morphological data to a more friendly JSON record set. My understanding of Diogenses is that it sits on top of TLG which is a licensed data. I don't have the license and I don't yet really understand how Perseus is drawn in Diogenes, I'm assuming through an API. If Mr. Hesin has plans to tackle Perseus that would be wonderful.
Jeff
Textkit Founder
-
- Textkit Zealot
- Posts: 3399
- Joined: Fri Jan 03, 2003 4:55 pm
- Location: Madison, WI, USA
- Contact:
Re: Perseus Project Data Set - XML?
You might have better luck interacting with the MySQL dumps. At the very least, you could import those into MySQL, then query the data back out and dump it into a more agreeable format.
William S. Annis — http://www.aoidoi.org/ — http://www.scholiastae.org/
τίς πατέρ' αἰνήσει εἰ μὴ κακοδαίμονες υἱοί;
τίς πατέρ' αἰνήσει εἰ μὴ κακοδαίμονες υἱοί;