Ars transcribendi

Textkit is a learning community- introduce yourself here. Use the Open Board to introduce yourself, chat about off-topic issues and get to know each other.
Post Reply
Carolus Raeticus
Textkit Enthusiast
Posts: 584
Joined: Mon Jun 07, 2010 11:46 am
Contact:

Ars transcribendi

Post by Carolus Raeticus »

Salvete,

I was approached concerning the subject of transcriptions. How do I transcribe texts? I started writing an answer and after a while it got rather long, so long in fact to merit placing it somewhere where others can read it as well.

How did I create transcriptions for Project Gutenberg-versions, for example the Project Gutenberg-version of Meissner's "Latin Phrase Book"? First of all, I did not create the Meissner-transcription explicitly for Project Gutenberg (PG). Creating a version for PG only came as an afterthought. My first version used Visual Basic for Microsoft Office to create a Word Document (horribly slow) which I then converted to PDF. Later on I used Javascript to create an HTML-version. And for producing the PG-version I used a Python-script (Visual Basic, Javascript, and Python are programming languages) to create both an HTML-version and a text version. Both needed some tweaking by hand, but not very much.

However, these conversion scripts are of less importance than the INPUT FILE which they use(d) as the basis to create the desired version. The input files for my transcription projects use a special format (varying depending on the specific text) to describe such things as Latin text, section headings, footnotes, etc. It is this input file which is important, anything else is merely a derivation. This is why I provide not only the final version (PDF, HTML), but also the input file for some of my projects so that it can be used for whatever purpose wanted (e.g. to create an app for smartphones etc).

In my opinion the worst approach to a transcription project is using a word processor like Microsoft Word or Libre Office for transcribing the text and immediately applying formatting. Using an input file in which things like headings, italics, bold, etc. are indicated using a simplified format (or tags) allows to change the appearance of the final version very easily. I exclusively create transcriptions using input files.

Before starting to transcribe the text one needs to have a close look at the text itself and think about which information needs to be preserved: headings of various levels/importance, bold/italic/underlined, etc. A few examples for simplified formatting tags I am currently using:
  • /xxx/ or _xxx_ = Italics ("/" or "_" depending on the amount of slashes in the text)
  • #xxx# = Bold
  • +xxx+ = Level 1 headings
  • ++xxx++ = Level 2 headings (+++, ++++, etc.)
  • {xxx} = Footnote
I deliberately do NOT use official HTML-tags like <i>xxx</i> when transcribing because that slows me down (well, actually I do use HTML-tags, but only for the more unusual formatting like nested lists). Try typing the following:
  • HTML-formatting: <i>Aurea</i> <b>prima</b> sata est <i>aetas</i>.
  • Simple-formatting: /Aurea/ #prima# sata est /aetas/.
The difference may not feel like much, but when you have a lot of text, the difference DOES matter! Using my own simple tags I can transcribe rapidly and easily and ideally enter some state of "flow", whereas using HTML-tags (or any other set of "complicated" formatting tags) has a "disruptive" influence on the act/process of transcribing.

I prefer single-letter formatting tags whenever possible. These should be chosen carefully, however, so that they are not used (too often) for non-formatting stuff. I use /xxx/ for italic and "\/" to escape the slash for "real" slahes (e.g. "Allan\/Greenough"). If I had a text with many real slashes I would choose a different formatting marker from "/". Also, a formatting marker (#, /, _, +) needs to be easily used when typing. I avoid the "@" for this reason because I tend to mistype when using it. I generally avoid letters which require finger gymnastics, except perhaps for formatting tags which are rare. But in that case I tend to use HTML-code (or fake HTML-code).

So, how do I go about a transcription project?
  • Choose the text to be transcribed
  • Get a copyright clearance from Project Gutenberg (no sense in transcribing a text you cannot upload anywhere legally).
  • Have a close look at the text and decide what aspects of it (especially formatting) should be preserved
  • Decide on the set of simplified formatting tags
  • Test the formatting tag on a page or two of representative text
  • "Transcribe" (= typing) the text for the input file.
  • Write/adapt a script for the conversion of the input file into a HTML or TEXT-version
  • Use the script on the input file
  • Manually tweak the result
  • Upload the final versions to Project Gutenberg.
Project Gutenberg has some guidelines for the TEXT and HTML-versions (length of lines, use of "_" to indicate italics, etc. If you provide only a text version, the HTML one will be automatically generated based on the text version. However, I want a reasonably good looking HTML-version and therefore create it myself.

The tools of the trade: a text editor (e.g. Notepad++ on Windows). To make myself absolutely clear: a text editor, not a word processor. I put the scanned book to the left of my computer desktop, place the window of the text editor on the right of it, and type away. I tried out typing using a printed version, but that did not work out for me. I feel that it is easier on the eyes (and less distracting) to have both the input (scanned book) and output (text editor) side by side.

Okay, this should give you a general idea of how I do transcribing. The reality is a tad more complicated, but this is more or less the workflow.

Valete,

Carolus Raeticus
Sperate miseri, cavete felices.

User avatar
bedwere
Global Moderator
Posts: 5103
Joined: Fri Mar 07, 2008 10:23 pm
Location: Didacopoli in California
Contact:

Re: Ars transcribendi

Post by bedwere »

Impressive. Thanks!

Post Reply