Contents Home Top Prev Next
Map Refs Pics Search Pages

Transcription of the text

Introduction

This page takes a closer look at the transcription of the Voynich MS text. It addresses both the history of transcription and the present status. This page is very relevant for those who are interested in the interpretation of the Voynich MS text, but it can be skipped by those who are not.

Where to find the main transcripion files is summarised in a section at the bottom of this page.

Purpose

The term transcription refers to the conversion of the handwritten text of the Voynich MS into a computer-readable file. The purpose of this is to allow computer software to perform analyses of this text, for example computing statistics or to aid the interpretation and ideally translation of the text. This type of transcription was first done for processing by so-called 'tabulating machines', in the days of William Friedman. Many years before that, in the 1930's, Fr. Th. Petersen of the Catholic University of America already made a complete hand transcription (i.e. a hand-written copy) of the MS using the original alphabet. This document, which has many interesting annotations made by Petersen, is still preserved in the William F. Friedman collection of the Marshall Library in Lexington (Va).

A transcription should be clearly distinguished from a proposed translation of the text. The only purpose of a transcription is to represent each handwritten character in the MS consistently by a symbol in a computer-readable file. It doesn't really matter which symbol is used for which character. Ever since the 1940's, different people have used different conventions, or transcription alphabets. In most cases they made different assumptions on what constitutes a single character in the Voynich MS.

History

FSG

The earliest well-known transcription of large parts of the Voynich MS was made in the 1940's by the so-called First Study Group (FSG) of William Friedman, which was briefly described here. The group had defined a transcription alphabet agreed by all members of the team. Because of the desire for secrecy by Friedman, for a very long time nobody outside his team was aware of this transcription exercise or the alphabet they used.

The FSG alphabet uses capital letters and numbers. It has a rather unusual method for transcribing the 'intruding gallows', namely by using a special symbol (Z) for the intruded pedestal (ch).
The FSG transcription is described in detail in Reeds (1995) (1). A printout of this transcription was discovered by Jim Reeds in the above-mentioned Marshall library, and together with Jacques Guy he entered it in computer readable form. The resulting file is >> still available for downloading.

Bennett

In 1976 Prof. William R. Bennett of Yale University published a book on the use of the computer to analyse certain properties of language and text (2). In this book, he dedicated a chapter to an analysis of the Voynich MS text, and for this purpose he needed a transcription alphabet. Bennett's alphabet has not been used outside of this publiation.

Currier

Roughly at the same time, Prescott Currier, working on his own, also devised a transcription alphabet, and he transcribed a significant part of the MS using this alphabet. His alphabet uses the capital letters A-Z and the numbers 0-9 (i.e. all 36). It does not represent some characters which the FSG does represent, and it uses single characters for what appear to be composite characters in the MS. When Currier presented his findings at the 1976 Voynich MS symposium, Mary D'Imperio suggested that it would be important that all researchers use a unified alphabet. She had also started transcribing parts of the MS using her own alphabet, which she abandoned in favour of Currier's, and the two files were merged. The combined transcription of Currier and D'Imperio, and the transcription alphabet designed by Currier, have been used well into the 90's, also in the earliest years of the internet.

The following table shows the three above-mentioned 'historic' alphabets together.

Char Bennett FSG Currier   Char Bennett FSG Currier
D 4 4   I I I
O O O   ILIE G
S8 8   IILIIE H
GG 9   IIILIIIE 1
Z2 2   IQIR T
LE E   IIQIIR U
QR R   IIIQIIIR 0
CTT S   UL D
ETS Z   NN(*) N
HH P   M M(*) M
PP B   IMIIIL 3
KD F    K J
FF V    IK K
CHTHZ Q    IIK L
CPTPZ W    IIIK 5
CKTDZ X    (6) 6
CFTFZ Y    (7) 7
A A A   Y Y (n)
C C C   V V (v)

(*) Note: Tiltman used the FSG alphabet, but instead of N and M wrote IL and IIL.

One thing that immediately appears from this table is that the various transcribers did not agree on what constitutes a single character in the Voynich MS text, and this turns out to be one of the most difficult points for transcription. Later, Jim Reeds showed that many characters in the Voynich MS cannot be represented exactly by any of the existing alphabets. There are some 'rare' characters, and in addition there are what appear to be ligatures of several characters. These characters have come to be called 'weirdoes', though here I will use the term 'rare characters'. Jim Reeds produced a document with an overview of these rare characters, calling them 'X1' to 'X128'. An excerpt of this barely known document is shown below.

Transcription in the age of the internet

With the emergence of the internet and the start of the Voynich mailing list, transcription of the parts that were still missing from Currier's and D'Imperio's efforts was continued using Currier's alphabet. A number of characters using lower case were added to this alphabet. These are already included (in parantheses) in the above table. As part of this new transcription, a more or less standard format for transcription files was introduced, which will be described further below.

In parellel, a completely new type of transcription alphabet was designed:

Frogguy

The 'Frogguy' transcription alphabet was designed by Jacques Guy in 1991. It uses lower-case characters, numbers and diacritical marks. Rather than representing complete characters, it represents the 'strokes' of the script of the Voynich MS, a property we may call 'analytical. As a result of this it produces the closest similarity with this script. At the same time, some characters in the MS which may be assumed to be a single character are represented by several using this alphabet. By its nature, it is the first alphabet that allows to represent many of the ligatured characters mentioned above. Jacques Guy also introduced the so-called 'capitalisation rule' which means that a character that is connected to the following one should be represented by a capital letter. The frogguy transcription alphabet and its use are explained in more detail at >> one of Jim Reeds' web pages.

An additional important contribution from Jacques Guy was a tool that allowed translation of transcription files from one transcrition alphabet to another. This tool was called BITRANS, was written in Pascal and ran in DOS (it still would). It could handle complex translation rules, such as needed for the Currier alphabet, where, for example, the translation of the character i depends on context.

The frogguy alphabet is the closest in appearance to the Voynich MS text, and it solves the main inconsistency of the Currier alphabet, namely that strings of consecutive i characters are all synthesised into single characters, whereas strings of consecutive e characters are not (3). However, it has a distinct disadvantage in that it is not very well suited for the type of numerical analysis mentioned at the top of this page, primarily because of its frequent use of the apostrophe, and to a lesser extent the mixture of letters and numbers. When Gabriel Landini and myself embarked on a new transcription of the MS, we decided that a new alphabet, but one very similar in principle to Frogguy, would be needed.

Eva

EVA originally stood for European Voynich Alphabet, but later became 'Extensible Voynich Alphabet'. It was designed to be similar to Frogguy, yet use only alphabetical characters. It was designed by Gabriel Landini and myself with important contributions and suggestions from Jacques Guy. Its design was part of a larger scheme which included:

'Basic Eva' is the set of (max.) 26 lower case characters and 'Extended Eva' includes the representation of all types of rare characters. The basic Eva alphabet was chosen such that the transcribed text is almost pronouncible. This excellent idea from Gabriel was not to be able to 'speak Voynichese', but it makes it very easy to recognise and remember transcribed words.

The rare characters can be classified into four categories:

  1. Unusual or rare single characters
  2. Ligatures of only basic characters
  3. Ligatures including rare single characters
  4. Anything else

These were all identified in the course of the transcription exercise. As it turned out, there were three single characters that should be called unusual because they did not appear in the Currier or Frogguy alphabets, but which did appear more than 10 times in the MS. These three characters: b, u and z were assigned their own 'basic Eva' letter (b, u and z respectively). The table of basic Eva is included here:

Individual rare characters, or rare parts of ligatures, were assigned a 'high ASCII' code, and the way to include them into transcription files is as follows: &185; for ascii code 185. The following image shows all such rare characters (6).

Ligatures may be represented in two ways. The first is the previously mentioned capitalisation rule. As the table of basic Eva above shows, this method allows an accurate representation of the text using the Eva True Type font. The capitalised character always connects to the next one. In addition, the following special cases should be mentioned:

Often, the characters Sh, cTh, cKh, cPh and cFh are simply written as sh, cth, ckh, sph and cfh respectively, which results that their rendition in the True Type Font is not optimal.

The second way to indicate ligatures in transcription files, which is more intuitive for the human reader, is to enclose the connected characters in parentheses. The following table shows examples of this. The script representation in this table is using the capitalisation rule.

EVA Capitalised EVA Using TTF
(ao) Ao Ao
(ee) Ee Ee
(cthh)ey cTHhey cTHhey
(ith) ITh ITh
(oy) Oy Oy
sh 1 Sh  Sh
(yk) Yk Yk

1 Example of standard though not optimal use.

File format - Stolfi Interlinear

To represent the transcription and to run analysis software using this data, it is highly desirable to have a standardised file format. Such a file should be 'annotated', meaning that there is so-called metadata which is not the text itself, but information about the text. Most importantly, it should indicate for each particular piece of text on which folio or page it is located, and where it is on the page. Additional relevant information may be added. The analysis software should be able to either interpret this information or to ignore it.

In the earliest days of the internet, when the Currier transcription was being extended, a format was agreed upon, and used in >>this historic transcription file. This format has been slightly extended in the course of the transcription by Gabriel Landini and myself. It is now used (with minor modifications) in the main resource: the interlinear transcription file of Jorge Stolfi (7), about which more will be said below. The following table explains the conventions used in the transcription file format:

# (hash) at start of line The entire line is a comment
{ ... } All text in curly brackets is equally a comment. This may appear anywhere in a line
< ... > This is a locus indicator, and appears at the start of each text segment. It tells where this text is to be found. The format of locus indicators is explained further below.
[ ... ] Alternative or uncertain reading. [ao] means it could be an a or an o, but the transcriber is not certain. The options may be separated by a | which is optional in case there are only two, but obligatory when there are more, e.g. [r|s|d]. If possible, the most likely option should be the first one.
. A word space (word separator)
, An uncertain word space
- A drawing intrusion in the text
= at end of line End of a paragraph.
\ at end of line the following line is a continuation of this locus
* A single illegible character
? An uncertain number of illegible characters
! A null character (only used in interlinear files, see below)

The locus indicator is defined as follows:

<f17r> without a dot means the start of a new page, in this case f17r. There will be no Voynich text following this, but there may be a comment, with metadata referring to the entire page.

<f17r.n> is the start of line n (normal text written in paragraphs)

<f17r.Cn> is the start of text of a 'special' locus, defined by the character 'C'. The following characters are defined:

L A label. This is a word (or a few words) near a drawing element, see e.g. f68r1
C "circular" text, i.e. text written around the circumference of a circular drawing element
R "radial" text, i.e. text written along the radius of a circular drawing element
B "blocked" text, i.e. like a normal paragraph, but not part of the main flow of text (e.g. displaced or rotated), see for example f86v3
S "single", like a label, but not specifically near a drawing element
(P) unofficial: used in Stolfi interlinear to indicate normal paragraph text

Stolfi also introduces a second dot between the locus type and the locus number. For interlinear files, where a single locus may be represented several times, as transcribed by different people, the locus indicator is post-fixed by an "author ID", which is a single upper case character. It is inside the < > brackets and preceded by a semicolon, as follows:
<f17r.1;H> is line 1 of f17r transcribed by Takeshi Takahashi.

To (pre-)process a transcription file for statistical analysis, a tool called VTT (Voynich Transcription Tool) was developed. This allows removal of the various types of annotations, conversion between the two types of ligature representations, conversion between &xyz; and high ascii, etc., etc. It also allows selection of pages depending on 'page variables', which is a dedicated type of comment (metadata). Without going into all detail, the following line in a transcription file:
<f1r> {$L=A $H=1 $I=H }
sets the three page variables L, H and I to the values A, 1 and H respectively. The tool VTT can be instructed to include or exclude pages with certain variables set to certain values. (These mean respecively: Currier language = A, Currier hand = 1, Illustration type = herbal).

v101

A few years later Glen Claston equally embarked on a complete transcription of the Voynich MS and devised a transcription alphabet which he called Voynich 101, (here v101 for short). The v101 alphabet was designed to keep stroke combinations that appear to be single signs as single characters. It furthermore distinguishes between several variants of characters that were considered to be one and the same in all previous alphabets. The entire text of the Voynich MS has indeed been transcribed by him using this alphabet. Following is his definition of the alphabet, and allocation to ASCII values.

Also for the v101 alphabet a True Type font has been designed, and like the Eva font it allows high-quality rendition of the Voynich MS text in electronic documents (Word, Excel). Also this font can be downloaded at the site map.

Summary

To summarise, it is worth listing both the significant achievements and the remaining problems.

Achievements

By now, the entire manuscript has been transcribed and is available for numerical analysis. There are several independent transcriptions, and in particular the ones made completely by one person will have good internal consistency.

Problem areas

Transcription of the Voynich MS is particularly difficult, because one cannot read the text. One has no help from context. The biggest problem is to decide whether two almost-similar-looking characters are two different ones, or two representations of the same one, where the variation is just caused by handwriting variations. While transcribing, one is forced continually to make such decisions, and it is unavoidable that these are subjective. A similar problem exists in deciding about word spaces. Sometimes, characters are offset just a bit and it is hard to decide whether there is a space or not. Even the introduction of the comma to represent an uncertain space doesn't completely resolve this problem. The only way out of both these problems is to have these decision made more objectively, i.e. by a piece of software (OCR) but we are far from being able to achieve this.

A second general problem is the lack of standards. The transcription files all have more or less significant differences in format. There is no place where they are all collected together. Even of the interlinear file there are several versions.

It is basically impossible to repeat analyses done by others since one doesn't know which data they used, and there isn't really any clear way how they could have specified it. The file format(s) used nowadays are far removed from modern representation standards. Already more than a decade ago, Rafał Prinke suggested the use of TEI (Text Encoding Initiative) but it never took off. I will write a bit more about this general topic on the final page of this site.

Location of main transcription resources

The main resource is the interlinear file which was initially prepared by Gabriel Landini, and further maintained by Jorge Stolfi. There is a >>page at Stolfi's site with links to this file in various compressed formats.

A >>tool by Elias Schwerdtfeger allows extraction of individual transcriptions from this file, with many useful options.

Some individual, older resources are:
>>The FSG transcription, prepared by Friedman's First Study Group.
>>The original transcription of D'Imperio and Currier (not complete).
>>The updated version of the same file, reformatted and with extensions by the original mailing list members.

Notes

1
Reeds (1995).
2
See Bennett (1976).
3
Of course, we have no way of knowing whether either assumption is correct or not.
4
But clearly, not all features of a transcription made directly in Eva would be possible to convert in a reversible way to older transcription alphabets.
5
Such a font has been designed by Gabriel Landini, and is used extensively at this web site. See the
6
Gabriel Landini added all of them to the True Type font.
7
It may be found >>here. It has also been used in the page by page description of the MS.

 

Contents Home Top Prev Next
Map Refs Pics Search Pages
Copyright René Zandbergen, 2017
Comments, questions, suggestions? Your feedback is welcome.
Latest update: 19/02/2017