PDF

Textkit is a learning community- introduce yourself here. Use the Open Board to introduce yourself, chat about off-topic issues and get to know each other.
Post Reply
tyke
Textkit Neophyte
Posts: 6
Joined: Fri Jan 03, 2003 4:06 pm

PDF

Post by tyke »

Hi!<br /><br />First posting by me! I'd like to say that I am really glad to see so many Ancient Greek and Latin texts being made available. I have been using the Ancient Greek texts for some time and I find them very uesful. However, could we just have a little think about the format in which these texts are being presented? Is PDF the best format for these texts? I would like to suggest a comparative study of PDF with another format called djvu. I recently scanned a couple of pages of a work called the History of the Greek Nation, which is written in katharevousa and is accompanied by colour plates, maps and drawings. The pages I selected were representative of the text. One example is the following: a map (in colour) with two columns of text. I scanned this at 200 dpi and saved it in TIF format (11.1MB). When I converted this into djvu format, the size was 39.3Kb! Besides that, the colour map was perfectly clear and so was the text! The file could be viewed at 300% without any noticeable distortion! When I saved the TIF file as PDF, the size was 357KB and at 122% the text was slightly blurred and only at 200% was the text fairly acceptable, but in no way was it as clear and crisp as the djvu format! If this comparison is to be believed, then the djvu format takes up roughly a tenth of the space a PDF file takes up and the quality is superior! What would this mean to files of 5MB and over?<br /><br />I have also tried this out on the Greek-English Lexicon by Liddell Scott. I am not sure whether I could put this on the Net due to copyright concerns, but the text was crisp and easy to read. I would hope that if a project was carefully planned, a dictionary of a manageable size could easily be produced. By the way a 1.3MB TIF file of the dictionary came out as a 31KB djvu file.<br /><br />Finally, I would like to recommend a site I was just browsing. It's http://www.mikrosapoplous.gr/en/texts1en.htm<br /><br />You will find all kinds of goodies at that site.

User avatar
Jeff Tirey
Administrator
Posts: 896
Joined: Wed Aug 14, 2002 6:58 pm
Location: Strongsville, Ohio

Re:PDF

Post by Jeff Tirey »

Hi Tyke,<br /><br />Nice link suggestion there's some really cool text there. Some of it looks to still be copyright protected however. :'( I like the use of unicode with some of the greek text i see. Very nice because it gets around font licensing. I'll have to take a closer look at these files later on.. <br /><br />This djvu sounds interesting. Where can it be downloaded? Its a format I'm not aware of. One issue with PDF is your convertion settings. There are endless adjustments you can make with Distiller, so without you stating your settings we can't make a fair comparison. If you use that default "PDF Writer" it won't give you an optimized file unless you go in with an optimized .tiff.<br /><br />I know what you mean about enlarging a scanned image without distortion - its a real headache and its even more a problem for those with high resolution monitors. That can be improved with scanning at a higher dpi, but as you know with that comes a larger file size. Ultimately its an unwinnable battle because that's the nature of images that are not vector.<br /><br />While our files view pretty good on a computer, we produce them with printing in mind. We scan all of our files at 300 dpi, bitonal, and .tiff type 4 compression. It produces a page that's on average about 50K. The dpi really isn't all that great by any more and the files are just too large. When converting from .tiff to pdf our file sizes just barely increase and this is really just PDF overhead like document security, descriptions, etc.. and not due to any conversion factors.<br /><br />The real reason behind us going with PDF is that it is a widely used format and almost everyone has Acrobat Reader installed.<br /><br />jeff
Textkit Founder

tyke
Textkit Neophyte
Posts: 6
Joined: Fri Jan 03, 2003 4:06 pm

Re:PDF

Post by tyke »

Jeff,<br /><br />http://ww2.worldgeorge.com:7015/~mike/D ... com.exe<br /><br />Try this link. You get a plug in for your Internet browser and you get the DjVuSolo which allows you to create your own files. It's free for non commercial use, as far as I know, which would mean that if a few of us got together, maybe we could make more material available, since I understand that many people don't have the money to fork out to pay for Adobe Distiller etc.<br /><br />If you get a chance, check it out. <br /><br />By the way, I didn't use Adobe to make the PDF file. I used FineReader 6.0 to make the PDF file.<br /><br />While I know that the site is dedeicated to Ancient Greek, couldn't we have some Hellenistic and Byzantine Greek as well? How about Procopius' Secret History to get the eyes popping?<br /><br />Robert

annis
Textkit Zealot
Posts: 3399
Joined: Fri Jan 03, 2003 4:55 pm
Location: Madison, WI, USA
Contact:

Re:PDF

Post by annis »

[quote author=tyke link=board=6;threadid=105;start=0#441 date=1052548300]<br />While I know that the site is dedeicated to Ancient Greek, couldn't we have some Hellenistic and Byzantine Greek as well? How about Procopius' Secret History to get the eyes popping?<br />[/quote]<br /><br />Well, I think the NT and Sept have opened the door at least to Hellenistic Greek. I mean, I hope Jeff wouldn't turn away Lucian if it appeared. :)<br /><br />As for eye-popping, a few days ago I sent Jeff two more books. The first is a Syriac Grammar (to go with the NT stuff). The second is nice old German edition of Petronius, plus the Carmina Priapaea.
William S. Annis — http://www.aoidoi.org/http://www.scholiastae.org/
τίς πατέρ' αἰνήσει εἰ μὴ κακοδαίμονες υἱοί;

User avatar
Jeff Tirey
Administrator
Posts: 896
Joined: Wed Aug 14, 2002 6:58 pm
Location: Strongsville, Ohio

Re:PDF

Post by Jeff Tirey »

We'll be posting more than Attic Greek as we find and develop the content. Next up on our posting schedule is Koine.
Textkit Founder

User avatar
Lex
Textkit Zealot
Posts: 732
Joined: Thu Apr 24, 2003 6:34 pm
Location: A top-secret underground llama lair.

Re:PDF

Post by Lex »

[quote author=jeff link=board=6;threadid=105;start=0#433 date=1052514249]<br />The real reason behind us going with PDF is that it is a widely used format and almost everyone has Acrobat Reader installed.<br />[/quote]<br /><br />Which is why I, for one, would prefer that the files here stay in PDF format. I really don't care to install YAODFR (yet another obscure document format reader) on my machine. We've already been through the standard format argument wrt Greek text in the forum. The same arguments apply to the document format.<br /><br />If people have a problem with the size of the documents, maybe they can contribute to an effort to get the scanned PDF image documents converted to the other PDF format.
I, Lex Llama, super genius, will one day rule this planet! And then you'll rue the day you messed with me, you damned dirty apes!

User avatar
Jeff Tirey
Administrator
Posts: 896
Joined: Wed Aug 14, 2002 6:58 pm
Location: Strongsville, Ohio

Re:PDF

Post by Jeff Tirey »

yeh - they're staying in PDF. <br /><br />and slow connection speeds are slowly becoming less and less a problem.
Textkit Founder

Carola
Textkit Enthusiast
Posts: 609
Joined: Tue Oct 15, 2002 12:34 am
Location: Adelaide, Australia

Re:PDF

Post by Carola »

I would also prefer to keep the documents in PDF as I have put some of the text books on a CD so I can use them when I am doing assignments in lunchhours etc - it would be difficult to read them on another computer - especially for university students etc. or people using a computer at work who can't load a different "reader" program.<br />The CD might be a good idea, however. Perhaps you could load a selection of Latin or Greek texbooks and a few of the "standards" everyone studies and sell them for a moderate cost + postage. There is a site on the internet specialising in desert plants (another hobby of mine) who put out a CD of collected articles & photos, which I purchased and have found very useful. <br />While high speed connections are common in USA & Australia, this is not the case in many other parts of the world. And of course students everywhere are usually at the bottom of the financial ant heap!

tyke
Textkit Neophyte
Posts: 6
Joined: Fri Jan 03, 2003 4:06 pm

Re:PDF

Post by tyke »

Hi!<br /><br />Can't really agree with the idea that it is YAODFR - it's being used all over Europe and America by some of the biggest libraries and universities. As for being a problem to read on other computers, if you are going to use a CD surely you could burn the programme on the CD - it's only 3MB. There is also a plug in for various Internet browsers. While I can appreciate that PDF is widely used, it doesn't mean to say it is the best. It's very expensive to buy Distiller and the quality is all too frequently mediocre. What I am asking for someone to do is to give it DJVU a whirl. <br /><br />The ideal solution of course would be to OCR the texts and have them as HTML files or .DOC files. There is a programme to do this called ANAGNOSTIS 4.0 (a Greek Product) which has had favourable reviews asnd which certainly reads Polytonic texts. Only problem is it costs 500 Euros. Anybody can persuade their departments to invest in this? <br /><br />Another solution is to bombard OCR software companies with requests to support Polytonic Greek. I use Finereader 6.0 and I have written them on several occasions to get them to include Ancient Greek in the next version. So far I am still waiting.<br /><br /> Robert

User avatar
Lex
Textkit Zealot
Posts: 732
Joined: Thu Apr 24, 2003 6:34 pm
Location: A top-secret underground llama lair.

Re:PDF

Post by Lex »

[quote author=tyke link=board=6;threadid=105;start=0#502 date=1052856051]<br />Can't really agree with the idea that it is YAODFR - it's being used all over Europe and America by some of the biggest libraries and universities. [/quote]<br /><br />I'd never heard of it before, so it's obscure enough. Besides, many people visit TextKit from computers that they can't install new programs onto (at work, school, the library, etc.). Acrobat Reader is going to be more common on those machines than DjVu.<br /><br />[quote author=tyke link=board=6;threadid=105;start=0#502 date=1052856051]<br />While I can appreciate that PDF is widely used, it doesn't mean to say it is the best. <br />[/quote]<br /><br />This is true. DjVu may be technically better. But there are other issues than speed or size. Like it or not, the PDF format is a de facto standard for many places. Does DjVu have a reader for Windows 95/98/NT/2000/XP, Linux, Macintosh, HP-UX, Solaris, IBM-AIX, DEC OSF-1, and SGI Irix? Acrobat does.<br /><br />We just had this very same argument regarding how Greek text should be represented on this site. The way I would have preferred would have allowed us to see picture perfect Greek if we have the proper font installed. But if the reader couldn't have that font installed, and was trained in the standard ASCII convention for representing ancient Greek, the text would have been very difficult to read. A compromise was reached which would give tolerable legibility to both those with and without the font. So something that I thought was technically better, wasn't, because I was only taking into account my desires. <br /><br />[quote author=tyke link=board=6;threadid=105;start=0#502 date=1052856051]<br />The ideal solution of course would be to OCR the texts and have them as HTML files or .DOC files. <br />[/quote]<br /><br />If you mean MS .doc files, they are only readable by people with MS Office. HTML would have the same font problem, unless the Greek portions are included as graphics, which would make the books huge and ungainly.
I, Lex Llama, super genius, will one day rule this planet! And then you'll rue the day you messed with me, you damned dirty apes!

User avatar
Jeff Tirey
Administrator
Posts: 896
Joined: Wed Aug 14, 2002 6:58 pm
Location: Strongsville, Ohio

Re:PDF

Post by Jeff Tirey »

Hi Tyke:<br /><br />I'm still in the PDF corner in terms of accessibility. I don't know how many downloads I would have if everyone had to first install a client. Visitors don't like installing things to view content -- think how may visitors freak out when they get the prompt to install Macromedia's Flash plug-in and that takes about 6 seconds. Heck, I don't really like to install things either.<br /><br />Adobe has spent the better part of a decade and millions of dollars trying to get their Acrobat Reader installed onto every computer - not an easy task. So why reinvent the wheel. Also, you don't need Distiller to create PDF files. True its a stable and robust application, but there are other freeware and very cheap products that can create PDF files. With PDF quality - there are tons of talented people out there who create PDFs correctly. I doubt if I'm always one of them but I try. <br /><br />If Distiller creates poor results I would think that its more likely to be conversion error on the user's part than any flaw with the software. Never forget the PDF is a printing format above all else. The format has morphed over the years and is asked to do other things but that's its trump card and I promise you that our grammars will print with very good results. Some times the printed PDF looks better than the original page because I brighten the image and digitally remove notes and other markings.<br /><br />As for OCR - digital text would be great - but I’m taking the "George Lucas" approach.<br /><br />Remember how long Lucas waited before producing episode 1? Well maybe you don't remember. But for me, the time difference was something like watching episode 6 in 4th grade and episode 1 in graduate school.<br /><br />Why did he wait so long? He wanted technology to catch up with his ideas. I bet all that modeling and stop action animation gave him an ulcer.<br /><br />Same thing with Textkit. Why beat myself over the head now trying to convert thousands of pages of text when if I wait it out there just might be a very accurate and robust application down the road. I'm willing to wait and see. Besides, I still need to build the .tiff library before OCR anyways. <br /><br />So think of Textkit as in phase 1 - building our library. I'm hoping that 5 years down the road we can start converting to digital text - but its way too much of headache right now. Trust me - I have tried. I used adobe acrobat capture to OCR a few grammars. $600 dollars $.03 per page on the meter later (yes the software is metered) I realized it was a serious waste of time and money - and that's just the English text! Funny thing is that a product called Textbridge Pro for like 9.99 did just as good of a job as Acrobat Capture - but it too wasn't good enough. I'll admit it - when it comes to OCR I'm little bitter. <br /><br />Foreign language books have to be 100% accurate and not 99% Can you imagine a table of paradigms or declensions with incorrect text?!? There would be some poor kid in Latin class going "regam, regas, reget" :o I would get hate mail for sure ;D ! <br />
Textkit Founder

tyke
Textkit Neophyte
Posts: 6
Joined: Fri Jan 03, 2003 4:06 pm

Re:PDF

Post by tyke »

Jeff,<br /><br />I don't really mind if we use PDF or any other format, although I still would prefer DJVU to PDF as far as cost and quality is concerned. It supports various OS such as Windows, Linux and Solaris and was developed by AT&T labs and as far as I know is being used in Universities and Libraries for storing things from manuscripts to technical drawings and maps. I have been using it for Sanskrit and have found it much better than .PDF. For any other information try http://www.lizardtech.com/solutions/document/<br /><br />By the way, I am in no way associated with the company or the product.<br /><br />As far as OCR is concerned, Finereader 6.0 is light years ahead of Omnipage, although there is still no support for Ancient Greek. Modern Greek is supported with spell checking capabilities. You can also turn .TIF into .PDF files. If you want a demo version of an OCR engine that can deal with Ancient Greek, try this:<br />http://www.ideatech-online.com/<br /><br />I have seen the software perform and it is very accurate, but the big drawback is the price -- 585 dollars, which I think is totally unacceptable. <br /><br />Something else that perhaps needs mentioning is PHP CDROM (I think that's Packard Hewlett - won't swear to that) which has a full Ancient Greek corpus with numerous Byzantine authors. Again the drawback is the price - 300 dollars I think and you have to get a search engine like Musaios which allows you to view only 25 lines at a time and the font sucks! <br /><br />Same thing has happened with the Od English Corpus. You have to pay through the nose to view your linguistic and literary heritage. <br /><br />Finally, I have written a Greek-English / English Greek Dictionary with over 21000 entries [the words only not the dictionary programme]. Unfortunately the Greek is modern Greek only. There is a way to produce a dictionary for Ancient Greek using this freeware software. If anyone with more knowledge of computers could look at the dictionary programme, I am sure we could all get together and enter our own words. Anyway, if anybody is interested the site is:<br />http://www.freelang.net/<br /><br />Robert<br />

Lisa
Textkit Neophyte
Posts: 27
Joined: Thu May 22, 2003 8:38 pm
Location: Somerville, MA
Contact:

Re:PDF

Post by Lisa »

Hi,<br />>Something else that perhaps needs mentioning is PHP CDROM (I think that's Packard Hewlett - won't swear to that) which has a full Ancient Greek corpus with numerous Byzantine authors. Again the drawback is the price - 300 dollars I think and you have to get a search engine like Musaios which allows you to view only 25 lines at a time and the font sucks! <<br /><br />You are confusing the PHI (Packard Humanities Institute) with the TLG (Thesaurus Linguae Graecae). PHI is mainly Latin (with a disc of Greek papyri and inscriptions, but not texts), while TLG is exclusively Greek. The texts on the latter are beta code in the back end, and displayed by whatever program you use to read the disk/site. (The old SMK fonts are not pretty, but the project began before most users had color displays, let alone font smoothing, etc.)<br /><br />Both projects started with the notion of collecting the data: both required separate software to read and search the data. (Pandora was one of the early such programs). These programs were created by others as the mission of both organizations was not software development. The scope of both projects required huge investments for materials which have a very small audience, so the model was to provide the CDs (and now site, in the case of the TLG) to institutional subscribers. (Both CDs were/are the staple of most classics departments, so this was a success.)<br /><br />There are many such examples of this model: CDs of Shakespeare collections cost thousands of dollars per license. This is because digitization and editing (and when applicable, rights) costs are high and audience is small. This month on the Classics-L, for instance, came this announcement:<br /> "Thesaurus Linguae Latinae on CD-ROM" (TLL 1), covering (only) words <br />> starting with O and P, for Euro 840, ISBN 3-598-40707-6. Future <br />> releases will cover more letters....<br />http://omega.cohums.ohio-state.edu:8080 ... /0089.html (some of the replies are of interest, too)<br /><br />This is not to excuse the publishers, but to point out this is the way that things have operated and continue to in many quarters. Someone has to make the investment for these types of endeavors, and until there is a academic publishing revolution, change will be slow coming.<br /><br />I may have glossed over some of the nuances of the TLG & PHI history, but I think any good search engine will cover the literature on the development and history of these projects.<br />TLG is here: http://www.tlg.uci.edu/~tlg/<br />PHI site is not yet up but will live here: http://www.packhum.org/<br /><br />Best,<br />Lisa<br />Managing Editor<br />Perseus Project

tyke
Textkit Neophyte
Posts: 6
Joined: Fri Jan 03, 2003 4:06 pm

Re:PDF

Post by tyke »

Lisa,<br /><br />yes, I should not have mixed up TLG with PHI - it was a a rather pathetic mistake. <br /><br />I did read some of the comments about the TLL CDROM and I find myself wholeheartedly in sympathy with the more critical opinions of the project. Of course the man subscribers are going to be University Classics departments, but what about the 'amateur' classicist, who is passionate in his admiration for and love of the old and venerable languages? I certainly can't afford to fork out such sums as are being bandied about by certain publishers. <br /><br />With the TLG a more enlightened policy does appear to be taking place, but it still riles me that I have to pay (and pay through the nose) to see the literary jewels of my country in digital form. It is happening with Ancient Greek and Latin. You made the point about Shakespeare and I would like to add to that the Old English Corpus which used to be available to anyone, but has now become available by subscription only - and a hefty one to boot.<br /><br />My rant - I wouldn't dignify it with any expression as lofty as argument - is that the literary and linguistic heritage left by my / our forefathers should be freely available to anyone and everyone. How does copyright come into play here? (I am not accusing, simply wondering) Surely these authors have been dead for more than 90 years or so. If,in fact, copyright does come into play here, how on Earth do people get copyright for works so old? It would be ludicrous to think that they asked permission of the authors' descendants! If I can go to a site like PG and download works by Charles Dickens and John Milton, why do I have to pay for works by Procopius, Apollodorus and Hesiod - just to name an eclectic few?<br /><br />You are a Perseid! Well, congratulations on that. I have followed the Perseus project ever since I got on the Net and it certainly is one of the best projects I have seen. It almost makes me want to go out and buy it - at least the cut down version at $100.<br /><br />Robert <br /><br />

annis
Textkit Zealot
Posts: 3399
Joined: Fri Jan 03, 2003 4:55 pm
Location: Madison, WI, USA
Contact:

Re:PDF

Post by annis »

Robert,<br /><br />I think a lot of the cost in these projects has little to do with copyright of the author per se, but rather the effort that it takes to get accurate digital texts. Optical Character Recognition (OCR) isn't very good with even English, and with polytonic Greek it is a good deal worse, so bad, if I recall correctly, that Perseus sends their texts out to be hand keyed-in from the start, and doesn't even bother with any OCR attempts.<br /><br />So, someone has to be paid to type in the texts. This is a slow and error-ridden process, as I know all too well from typing in even the very small poems at Aoidoi.org. Especially when familiar with a text it is too easy to proof a line and see it as correct when in fact it is not. So then you have to hire proofreaders. Probably several passes of that are necessary. They also have to be paid.<br /><br />Finally, though I've not seen TLG, I would assume that to be truly useful to scholars it would have to have some sort of critical apparatus in place, and this will represent a huge time investment. The book edition of M.L. West's most recent editing of the Iliad (in two volumes) for Teubner is more than $40 in paperback a piece for both volumes. Critical editions cost a lot, and we must accept the simple fact that for 99.9% of people reading the Iliad in Greek, the app.crit. is required. Those of us reading for pleasure are probably happy most of the time with just the text, but I cannot honestly say I've seen any modern editions like that. Only in used bookstores did I find a Herodotus just for reading... page and line numbers, no app.crit.<br /><br />The critical apparatusis copyrightable, as is a particular editor's rendition of a text. Which is why the Perseus texts are copyrighted... the texts are mostly Oxford's. For Aoidoi.org I use two texts that are very recent to produce some of the poems. Since I'm comparing these with older ones, and basically creating my own version of the text, I avoid copyright problems. Certainly Anacreon's descendents don't have any say in this process. M.L. West and Teubner would if I just started copying verbatim. <br /><br />So, while I would certainly love to get my hands on TLG for free, or at least at a more reasonable cost, the reality is that the cost of producing it well will keep it out of non-professional hands. I agree this is very annoying. As a Unix guy, I want to give useful information away as much as possible! But I can see how TLG policy ended up the way it has.<br /><br />So for now, supporting Textkit's projects seems the best alternative, even if PDFs of book page images isn't flexible, it at least gets the texts out there. And that ain't free neither. <br />
William S. Annis — http://www.aoidoi.org/http://www.scholiastae.org/
τίς πατέρ' αἰνήσει εἰ μὴ κακοδαίμονες υἱοί;

Lisa
Textkit Neophyte
Posts: 27
Joined: Thu May 22, 2003 8:38 pm
Location: Somerville, MA
Contact:

Re:(was PDF) copyright & overhead

Post by Lisa »

Robert,<br /><br />I'm not familiar enough with the history of the TLG to speak to their expenses, but all of the Greek in Perseus was double-keyed professional data entry to an accuracy rate of 99.95%. This is quite expensive: a few thousand dollars for an average size Loeb text. (Say, something like $1200/Mb). Then, there is the programming, editing, and processing required to take raw sgml and make it into something viewable. And the TLG has more editors working with more textual oddities and a far larger corpus than we have. It is the work we have done tagging and adding structure to the works which we assert copyright over (the Perseus environment), most of which is invisible to the user (one hopes).<br /><br />Copyright comes into play because every edition, Teuber, Oxford, Loeb, etc., includes editorial content. Unless you are transcribing a manuscript (which would require the permission of the manuscript owner), you are culling a work from a variety of sources, and making decisions on which accents go where, which words are spurious, and what reading makes the most sense. All that app. crit. is considered intellectual property (as well it should be).<br /><br />Perseus did offer full text downloading when we first went on-line, and we are hoping to implement pdfs of all public domain texts in the near future. We had to stop the full downloads due to other agreements with publishers of works in Perseus. Remember, we had contracts which allowed us to produce CDs, but the legal eagles never envisioned the WWW, so we were tied by the terms of the agreements and professional courtesy to those who had provided us support.<br /><br />I agree with William wholeheartedly that this stuff should be as widely-available as possible, but publishers will not hesitate to shut down a site which is in violation of copyright or other agreements. A free site is a problematic model over the long term, though. Even a bus-station grey WWW page with .txt files costs money. (And something as well-designed and Textkit costs more money.) Fortunately, there are many who are willing to use their WWW access for the purpose of spreading a bit of ancient culture, but remember that none of this is free.<br /><br />For the record, we did not set the price point of the CDs and we don't get any revenue from them. (Can you tell I've answered this before?) It is interesting to contrast university press pricing with that of the fine folks of Centaur Systems, say. Different operations with different visions of their audiences.<br /><br />Hope that your weather is not as wet and grey as ours up here,<br />Best,<br />Lisa

Post Reply