Contents Home Top Prev Next
Map Refs Pics Search Pages

Text Analysis - Transliteration of the Text

Introduction

The main mystery of the Voynich MS is clearly its unknown writing. This topic is addressed from three different aspects, on three (sets of) pages:

This page addresses the second part, the transliteration of the text. It summarises historical transliteration efforts and describes the present state of things. This page is very relevant for those who are interested in the interpretation of the Voynich MS text, but it can be skipped by those who are not.

A section at the bottom of this page explains where one may find the main transliteration files.

Purpose and terminology

The purpose of transliteration of the text is the conversion of the handwritten text of the Voynich MS into a computer-readable format (file). The aim of this is to allow computer software to analyse the text, for example in order to derive statistics or to aid the interpretation and ideally translation of the text.

Already in the 1976 seminar about the Voynich MS led by Mary D'Imperio (1) this process was called 'transcription' and to my best knowledge this term has been used ever since, in all discussions and publications about the MS. It has also been used throughout this web site until May 2019, when the difference between transcription and transliteration was clarified to me by a professional linguist (2). In his words:

Transcription means a transformation of a text such that it will be substituted by another text made by a well-known alphabet of phonetic symbols. This is only possible in case we already know how the text reads.
[...]
Transliteration is a symbol-by-symbol conversion of one script into another. Transliteration is used for practical purposes, but does not represent any hypothesis on the original pronunciation of the source text. The distinctive feature of transliteration is that it preserves the source text and is reversible. The phonetic data could be totally missing, or not yet available, or irrelevant.

The first time that this type of transliteration was exercised was by William Friedman after WWII, for processing by so-called 'tabulating machines'. Many years before that, in the 1930's, Fr. Th. Petersen of the Catholic University of America already made a complete hand-written copy of the MS using a complete photocopy of the MS and in some cases, in order to transliterate difficult parts, the actual MS. This document, in which he added many interesting annotations, is still preserved in the William F. Friedman collection of the Marshall Library in Lexington (Va) (3).

A transliteration should be clearly distinguished from a proposed translation of the text. The only purpose of a transliteration is to represent each handwritten character in the MS by a symbol in a computer-readable file in a consistent manner. It doesn't really matter which symbol is used for which character. Ever since the 1940's, different people have used different conventions, or transliteration alphabets.

History

FSG

The earliest well-known transliteration of large parts of the Voynich MS was made in the 1940's by the so-called First Study Group (FSG) of William Friedman, which was briefly described here. The group was working from photostatic copies of the pages of the MS, and they had defined a transliteration alphabet agreed by all members of the team. Because of the desire for secrecy by Friedman, for a very long time nobody outside his team was aware of this transliteration exercise or the alphabet they used.

The Friedman transliteration was never meant to be a complete transliteration of all text in the MS. They decided to concentrate on the linear text written in paragraphs, and not bother with the complicated diagrams with labels and text written in circles. This is very clear from handwritten annotatations made on the source copies saying "don't punch this" or "punch only this".

The FSG transliteration alphabet uses capital letters and numbers. It has a rather unusual method (by later standards) for transliterating the 'intruding gallows', namely by using a special symbol (Z) for the pedestal (ch).

The FSG transliteration effort is described in detail in Reeds (1995) (4). A printout of this transliteration was discovered by Jim Reeds in the above-mentioned Marshall library, and together with Jacques Guy he entered it in computer readable form. The resulting file is >> still available for downloading. This transliteration is far from complete for the reasons mentioned above.

Bennett

In 1976 Prof. William R. Bennett of Yale University published a book on the use of the computer to analyse certain properties of language and text (5). In this book, he dedicated a chapter to an analysis of the Voynich MS text, and for this purpose he needed a transliteration alphabet. Bennett's alphabet has not been used outside of this publication. The transliteration was made from photographs and stored on paper tape. As reported by Brumbaugh (6), Bennett was assisted in his Voynich MS analysis by Jonina Duker, who was then a sophomore student (7).

Currier

Roughly at the same time, the cryptanalist Prescott Currier started a major transliteration effort, using an alphabet he designed independently. This alphabet uses the capital letters A-Z and the numbers 0-9 (i.e. all 36). It does not represent some characters which the FSG alphabet does, and it uses single characters for what appear to be composite characters in the MS. When Currier presented his findings at the 1976 Voynich MS symposium, Mary D'Imperio suggested that it would be important that all researchers use a unified alphabet. She had also started transliterating parts of the MS using her own alphabet, which she abandoned in favour of Currier's, and the two files were merged. The combined transliteration of Currier and D'Imperio, and the transliteration alphabet designed by Currier, have been used well into the 1990's, also in the earliest years of the internet. In the following, the transliteration may be referred to by the abbreviation "C-D".

The following table shows the three above-mentioned alphabets together.

Char Bennett FSG Currier   Char Bennett FSG Currier
D 4 4   I I I
O O O   ILIE G
S8 8   IILIIE H
GG 9   IIILIIIE 1
Z2 2   IQIR T
LE E   IIQIIR U
QR R   IIIQIIIR 0
CTT S   UL D
ETS Z   NN(*) N
HH P   M M(*) M
PP B   IMIIIL 3
KD F    K J
FF V    IK K
CHTHZ Q    IIK L
CPTPZ W    IIIK 5
CKTDZ X    (6) 6
CFTFZ Y    (7) 7
A A A   Y Y (n)
C C C   V V (v)

(*) Note: Tiltman used the FSG alphabet, but instead of N and M wrote IL and IIL.

One thing that immediately appears from this table is that the various researchers did not agree on what constitutes a single character in the Voynich MS text.

Later, Jim Reeds showed that many characters in the Voynich MS cannot be represented exactly by any of the existing alphabets. There are some 'rare' characters, and in addition there are what appear to be ligatures of several characters. These characters have traditionally been called 'weirdoes', though here I will use the term 'rare characters'. Jim Reeds produced a document with an overview of these rare characters, calling them 'X1' to 'X128'. An excerpt of this little known document is shown below.
 

Transliteration in the age of the WWW

With the emergence of the "World Wide Web" and the start of the Voynich mailing list, transliteration of the parts that were still missing from Currier's and D'Imperio's efforts was continued using Currier's alphabet. A number of characters using lower case were added to this alphabet. These are already included (in parentheses) in the above table. As part of this new transliteration, which will be referred to as "Vnow", a file format for this transliteration file was adopted, about which more will be said further below.

In parellel to this, a completely new type of transliteration alphabet was designed:

Frogguy

The 'Frogguy' transliteration alphabet was designed by Jacques Guy in 1991. It uses lower-case characters, numbers and diacritical marks. Rather than representing complete characters, it represents the 'strokes' of the script of the Voynich MS. As a result of this approach it manages to produce the closest similarity of the transliterated text with the script. At the same time, some characters in the MS which may be assumed to be a single character are represented by several using this alphabet. By its nature, it is the first alphabet that allows to properly represent many of the ligatured characters that were mentioned above. Jacques Guy also introduced the so-called 'capitalisation rule' which means that a character that is connected to the following one should be represented by a capital letter. The Frogguy transliteration alphabet and its use are explained in more detail at a preserved copy of >> one of Jim Reeds' web pages. It may be of interest to show an example of this alphabet:

daiin qotedy dar ochol osaror
8aiiv     40qpc89   8a2   octox     osa2o2

An additional important contribution from Jacques Guy was a tool that allowed translation of transliteration files from one alphabet to another. This tool was called BITRANS, was written in Pascal and ran in DOS (it still would). It could handle complex translation rules, such as needed for the Currier alphabet, where, for example, the translation of the character i depends on context.

The Frogguy alphabet is the closest in appearance to the Voynich MS text, and it solves the main inconsistency of the Currier alphabet, namely that Currier synthesised strings of consecutive i characters into single characters, while this was not done for strings of consecutive e characters (8). However, it has a distinct disadvantage in that it is not very well suited for the type of numerical analysis mentioned at the top of this page, primarily because of its frequent use of the apostrophe, and to a lesser extent the mixture of letters and numbers. No signficant transliterations have been made using this alphabet.

When Gabriel Landini and myself embarked on a new transliteration of the MS, a project that we named "EVMT" at the time, we decided that a new alphabet, but one based on the principles of Frogguy, would be needed.

Eva

EVA originally stood for European Voynich Alphabet, but later became 'Extensible Voynich Alphabet'. It was designed to be similar to Frogguy, but to use only alphabetical characters. It was designed by Gabriel Landini and myself with important contributions and suggestions from Jacques Guy. Its design was part of a larger scheme which included:

'Basic Eva' is the set of lower case characters that was identified, and 'Extended Eva' includes the representation of all types of rare characters. The basic Eva alphabet was chosen such that the transliterated text is almost pronouncible. This excellent idea from Gabriel was not to be able to 'speak Voynichese', but it makes it very easy for the human brain to recognise and remember transliterated words.

The rare characters can be classified into four categories:

  1. Unusual or rare single characters
  2. Ligatures of only 'Basic Eva' characters
  3. Ligatures including rare single characters
  4. Anything else

These were all identified in the course of the transliteration exercise, which was based on two documents: the Petersen hand copy of the MS already mentioned above (see note 3), and the Yale "copyflo" (11). As it turned out, there were three single characters that might be called unusual because they did not appear in any of the previously defined transliteration alphabets, but which occurred several times in the MS. These three characters: b, u and z were assigned their own 'Basic Eva' letter (b, u and z respectively). The table of Basic Eva is included here:

Individual rare characters, or rare parts of ligatures, were assigned a 'high ASCII' code, and the way to include them into transliteration files is now defined as follows: @185; for ascii code 185 (12). The following image shows all such rare characters (13).

Ligatures of characters may be represented in two ways. The first is the previously mentioned capitalisation rule introduced already by Jacques Guy. As the table of basic Eva above shows, this method allows an accurate representation of the text using the Eva True Type font. A capitalised character always connects to the next one. In addition, there are the following special cases:

Often, the characters Sh, cTh, cKh, cPh and cFh are simply written as sh, cth, ckh, sph and cfh respectively, which results that their rendition in the True Type Font is not optimal.

The second way to indicate ligatures in transliteration files, which is more intuitive for the human reader, is to enclose the connected characters in curly brackets (14). The following table shows examples of this. The script representation in this table uses the capitalisation rule.

EVA Capitalised EVA Using TTF
{ao} Ao Ao
{ee} Ee Ee
{cthh}ey cTHhey cTHhey
{ith} ITh ITh
{oy} Oy Oy
sh 1 Sh  Sh
{yk} Yk Yk

1 Example of standard though not optimal use.

Takeshi Takahashi

The Eva alphabet, primarily in its basic form, found great reception in the community, and it was used by the Japanese Takeshi Takahashi to produce a new transliteration, also based on the Yale "copyflo" (see note 11), which was the first that could be considered essentially complete. He used 'capitalised Eva' to represent all benched or pedestalled characters, e.g. 'Sh' for Sh. He has also updated his transliteration over time, based on reports from users.

The interlinear transliteration file

Based on an initial effort by Jim Reeds, Gabriel Landini collected the important older transliterations mentioned above into a single interlinear file, meaning that they were presented together line by line. They were all converted into Eva, since this conversion is consistent and revisersible, as mentioned above.

The brazilian Jorge Stolfi took this interlinear file and put very significant effort into improving it, adding further transliterations. This included the complete transliteration of Takeshi Takahashi (status of 1999), where he converted the capitalised Eva to lower-case. He also added his own transliterations and partial transliterations from several others, for example John Grove. As a result, this interlinear file has become the de facto source for transliteration data. In the following, it will be referred to as the Landini-Stolfi Interlinear file, with abbreviation "LSI".

v101

A few years later, Glen Claston embarked on a complete transliteration of the Voynich MS and devised a transliteration alphabet which he called Voynich 101 (here v101 for short). The v101 alphabet was designed to keep stroke combinations that appeared to him to be single signs as single characters. Furthermore, it distinguishes between several variants of characters that were considered to be one and the same in all previous alphabets. He transliterated the entire text of the Voynich MS using this alphabet. Following is his definition of the alphabet, and allocation to ASCII values.

Also for the v101 alphabet a True Type font has been designed, and like the Eva font it allows high-quality rendition of the Voynich MS text in electronic documents (Word, Excel). Also this font can be downloaded at the site map. In the following, this transliteration will be indicated by the abbreviation "GC".

Transliteration file format

For the analysis of the various Voynich MS text transliterations, it is highly desirable that these transliterations are contained in files with a well-defined (standardised) format. Such files should be 'annotated', meaning that they include so-called metadata, which is not the text itself, but information about the text. This metadata should indicate, for example, for each particular piece of text on which folio or page it is located, and where it is on the page. Additional relevant information about the pages or the text may be included. Analysis software should be able to either interpret this information or to ignore it.

In the earliest days of the internet, when the transliteration by Currier was being extended, a format was agreed for this new file. This format was extended in the course of the transliteration by Gabriel Landini and myself. It was further modified by Jorge Stolfi, and used in this final version in the main resource: the Stolfi-Landini interlinear file.

Unfortunately, the differences are quite significant, and the v101 transliteration file uses yet another representation. It is therefore no longer useful to describe the original format definition from the EVMT project.

It is only worth mentioning that all transliterated items were preceded by a locus indicator, which gave some information about where the transliterated item was to be found in the MS. Also the structure of the locus types varied so much between files, and even inside the "LSI" file, that no summary can be provided or would be useful. The "GC" file identifies loci by a line number, or a more descriptive identification in other cases.

Jorge Stolfi also introduced the useful feature in the "LSI" file where a single locus may be represented several times, as transliterated by different people. In this case the locus indicator is post-fixed by an "author ID", which is a single upper case character.

Transliteration tool

To (pre-)process a transliteration file for statistical analysis, I developed a command line tool called VTT (then: Voynich Transcription Tool). This was based on the transliteration file format of the "LZ" file, and it allowed several useful operations on the text, primarily the removal of the various types of annotations, and the selection of pages depending on their properties.

This 'page selection option' relied on the availability of specific comments in the page headers defining 'page variables'. These have been used consistently in the "LSI" and "LZ" files. The tool VTT can be instructed to include or exclude pages with any combination of variables set to certain values. This useful feature has also been included in the "Voynich Information Browser" of Elias Schwerdtfeger, for which see below.

The state of affairs in 2017

To summarise the status of available transliteration resources by the year 2017, there were significant achievements, but also very clear problem areas.

Achievements

Several independent transliterations of the Voynich MS are available. There are three that are almost complete:

A more complete list of transliteration efforts is given below.

Problem areas

Transliteration of the Voynich MS is particularly difficult, because one cannot read the text. One has no help from context. The biggest problem is to decide whether two almost-similar-looking written shapes are two different ones, or two representations of the same one, where the variation is just caused by handwriting variations. While transliterating, one is continually forced to make such decisions, and it is unavoidable that these are subjective. A similar problem exists in deciding about word spaces. Sometimes, characters are offset just a bit and it is hard to decide whether there is a space or not. Even the introduction of the comma to represent an uncertain space doesn't completely resolve this problem. The only way out of both these problems would be to have these decision made more objectively, i.e. by a piece of software (OCR) but we are far from being able to achieve this.

A second general problem is the lack of standards. As mentioned above, the transliteration files all have rather significant differences in format. There is no place where they are all collected together. Of the interlinear file there are also several versions.

Up until 2017 it was impossible to present the "GC" transliteration in any of the standard formats, because many of the symbols that have a special meaning for annotation in the "LSI" (and the "LZ") file are representations of Voynich MS text characters in the "GC" transliteration. A solution to this problem had to be found.

It is almost always impossible to repeat analyses done by others since one doesn't know which data they used, and there isn't really any clear way how they could have specified it. The file format(s) used nowadays are far removed from modern representation standards. Already more than a decade ago, Rafał Prinke suggested the use of TEI (Text Encoding Initiative) but it never took off. There are a few more comments about this general topic on the final page of this site, and a more detailed description of the problem and its resolution is provided on a separate page:

Next-generation transliteration of the text

New initiative

New transliteration file format

In order to be able to represent all existing transliterations in their own original transliteration alphabet, using a single file format, a number of issues had to be resolved, primarily related to the character set of the v101 alphabet, which was shown above.

The result of this exercise was the definition of a new format, which I have called IVTFF (Intermediate Voynich Transliteration File Format).

New transliteration file format

The following table explains the conventions used for this format.

The complete format description (version 1.6 of November 2019) may be found here.

# (hash) at start of line The entire line is a comment.
<f ... > With the < character in the first position of the line. This is a locus indicator, and appears at the start of each text segment. It explains where this text is to be found. The format of locus indicators is explained further below.
<! ... > With the < *NOT* in the first position of a line. All text between the delimeters is a comment. This may appear anywhere in a line.
. A word space (word separator).
, An uncertain word space.
[ ... ] Used for alternative or uncertain reading. [ao] means it could be an a or an o, but it is not certain. The options may be separated by a : which is optional in case there are only two, but obligatory when there are more, e.g. [r:s:d]. If possible, the most likely option should be the first one.
{ ... } A ligature of standard characters. Only used with the Eva and Frogguy alphabets
@nnn; A high-ascii code. nnn must be in the range 128 to 255
<-> A drawing intrusion in the text.
<$> at end of line. End of a paragraph.
/ at end of line This locus is continued on the next line in the file. That line must also have a / in the first position.
? A single illegible character.
??? An uncertain number of illegible characters.

There are three types of locus indicators:

<f17r> A locus indicator without a period means the start of a new page, in this case f17r. There will be no Voynich MS text following this, but there may be a comment, with metadata referring to the entire page.
<f17r.N,@Ab> This is the start of a piece of text of locus type 'Ab'. The value of N must increase monotonously, starting at 1, for each page. The meaning of @ is explained in the detailed format description. The locus type Ab is as in the Table shown under the heading 'Inventory of the transcribed text', above.
<f17r.N,@Ab;T> Same as above, but identifiying the source of the transliteration (a person or a group) by the character T.

Inventory of the transliterated text

As mentioned above, three nearly complete transliterations of the text exist. To establish the coverage of the Voynich MS text by each of them, I have compared all three with each other, and with the digital images of the MS. This showed that there are some minor inconsistencies in counting the different loci, which could be resolved.

The analysis showed that the "LZ" transliteration, which benefited strongly from the work of Theodore Petersen, is most complete. It lacks only:

I have created a new file which includes only my own part of the LZ transliteration effort, and updated it to include all 5389 loci. This will be referred to as the "ZL" file (Zandbergen-Landini) in the following, and I acknowledge the important influence of Gabriel Landini in creating my own transliteration.

All loci have been distributed over different types, which are classified by two characters. Following is the distribution of the loci over these types, for the entire MS (15).

Type Subtype Meaning Count
P - All running text in paragraphs 4130
- P0 Standard left-aligned text 3885
- P1 Lines indented by half a page due to drawings or other text 72
- Pb Dislocated text in free-floating paragraphs 134
- Pc Centred lines in normal paragraphs 17
- Pr Right-justified lines in normal paragraphs 12
- Pt Titles in normal paragraph text 10
L - All labels and dislocated words or characters 1033
- L0 Individual words or characters not near a drawing element 297
- La Labels of astronomical or cosmological elements which are not stars or zodiac labels 7
- Lc Labels of containers in the pharmaceutical section 40
- Lf Labels of herb fragments in the pharmaceutical section 195
- Ln Labels of nymphs in the biological/balneological section 63
- Lp Labels of full plants in the herbal section 3
- Ls Labels of stars 76
- Lt Labels of 'tubes and tubs' in the biological/balneological section 47
- Lx Individual pieces of 'external(?)' writing 6
- Lz Labels of zodiac elements 299
C - All writing along circles 84
- Ca Anti-clockwise writing along circles 1
- Cc Clockwise writing along circles 83
R - All writing along the radii of circles 142
- Ri Inwards writing along the radii of circles 75
- Ro Outwards writing along the radii of circles 67
Total     5389

To pre-process all files in IVTFF format, the VTT tool has been modified and called IVTT.

Voynich Transliteration Tool

Location of main transliteration resources

Following are links to the original transliteration resources that are known to me, and to the new copies in IVTFF format. They are presented roughly in chronological order.

Code Description Alphabet Original Copy in IVTFF Notes
FSG A copy of the FSG transliteration, created by Friedman's First Study Group, in the format prepared by Jim Reeds and Jacques Guy. FSG >>Link beta version 0d (16)
C-D The original transliteration of D'Imperio and Currier Currier >>Link beta version 0b (17)
Vnow The updated version of the C-D transliteration, made during the earliest years of the Voynich MS mailing list. Currier >>Link Temporarily unavailable ! (18)
TT The original transliteration by Takeshi Takahashi. Eva At a >>set of separate pages at his web site. 1999 version (19)
LSI The Landini-Stolfi Interlinear file Eva At a >>page at Stolfi's site with links to this file in various compressed formats. beta version (23/08/2017)
GC The v101 transliteration file. v101 Local copy. - original locus order
- sorted
(20)
ZL The "Zandbergen" part of the LZ transliteration effort. Eva - version 1d (21)

The >>Voynich Information Browser by Elias Schwerdtfeger allows extraction of individual transliterations from the "LSI" file, with many useful options.

Notes

1
See D'Imperio (1976).
2
I am grateful to prof. Artemij Keidan of the Sapienza University in Rome. His interesting paper refuting the incorrect proposed solution to the Voynich MS by G. Cheshire included clear statements about this terminology, which he was later able to clarify to me in great detail.
3
It is contained in items 1615.1 and 1615.2 of the William F. Friedman collection. For more about the Marshall Library, see here.
4
See Reeds (1995).
5
See Bennett (1976).
6
See Brumbaugh (1978).
7
I am grateful to Jonina Duker for her recollections of these events of 40 years ago.
8
Of course, we have no way of knowing whether either assumption is correct or not.
9
On the other hand, it should be clear that not all features of a transliteration made directly in Eva can be converted into the older transliteration alphabets in a reversible manner.
10
Such a font has been designed by Gabriel Landini, and is used extensively at this web site. See the site map.
11
A black-and-white printout of a microfilm made in the 1970's, see also here.
12
This used to be &185; but in the 2017 conversion exercise described in the following this has been changed. See also here.
13
Gabriel Landini added all of them to the True Type font.
14
This used to be between parentheses but in the 2017 conversion exercise described in the following this has been changed. See also here.
15
See also the page about the writing system.
16
Extracted from the LSI file and converted back to the FSG alphabet. Not yet fully checked against the original.
17
Converted from the original.
18
Extracted from the LSI file and converted back to the Currier alphabet. It turns out that the LSI file was incomplete, so this has to be re-done. Important, since this file was used by Reddy and Knight, and part of it by Hauer and Kondrak.
19
The 1999 version has been extracted from the LSI file. The current versions still has to be processed.
20
The file has been verified. The only change in locus order is related to the page order, i.e. the rosettes page has been moved to lie between f85r2 and f86v4.
21
Version 1c of 20/11/2019 has all known errors fixed and supports file format 1.6. Versions before 0h had most long cicular texts corrupted.

 

Contents Home Top Prev Next
Map Refs Pics Search Pages
Copyright René Zandbergen, 2019
Comments, questions, suggestions? Your feedback is welcome.
Latest update: 14/12/2019