Contents Home Map

Special Topics: Transcription of the text

Next-generation transcription of the text

Introduction

The main page related to text transcription describes the presently available transcriptions systems and files, and a bit of history. In order to fully understand the present discussion, the reader should be familiar with its contents. The 'main page' shows that available transcriptions are based on concepts of several decades ago. They use simple text files which allow very easy parsing by simple tools, which is convenient, but this is not in line with modern representation standards. It was already suggested on that page that a new generation of transcription is due, and a similar feeling was expressed in the page describing my views. In the following this is discussed in more detail.

I would like to address one potential misunderstanding right from the beginning. The existing transcriptions are not "bad". Under the simple assumption that the text of the Voynich MS should be a plain text encrypted using some standard method of the 15th century, it would have been solved easily using even the earliest transcriptions made by Friedman or by Currier. That this did not happen is not the fault of the transcriptions, but of the assumption.

The purpose (in my opinion) of a new transcription effort is to have the best possible data available in a state-of-the-art manner, and capturing much more necessary and useful information than the present transcriptions do. This is independent of the particular approach that one might choose for attacking the text.

The source of all transcriptions

Our primary source is the physical MS itself. This is, however, not easily accessible. What's more, even if we had some time to study it, we would not be able to do any significant text analysis, since we cannot search the text without leafing through the entire book.

The first derivatives

The first level of abstraction is given by analog pictures of the MS, such as, for example, in the recent photo facsimiles (1). With these, we lose something, namely the interaction with the physical object, and everything that we can only observe from the original. We also gain something, namely ease of access. We can have the book at home and browse it anytime. At the same time, searching the MS is still only possible by leafing through the entire book.

The second level of abstraction is provided by the digital images of the MS. With these, we have a few more advantages, namely:

The next level

The next level of abstraction is given by the electronic text transcriptions, but I won't call this the third level. As will become clear in the course of this page, I would consider this the fifth level of abstraction.

In this lowest abstraction we both lose and gain enormously. What we gain is the fact that we can search the text, both manually and in automated processes, and we can generate statistics, indexes etc. What we lose may be summarised as follows:

  1. We lose a lot of information that may be collected under the header 'layout'. We no longer see the handwriting characteristics, the spacings of the lines, whether characters or words are far apart, etc., etc.
  2. We lose a lot of distinctions in the written characters that may well be relevant. Every transcriber makes many decisions that are irreversible, typically when deciding that two similar-looking character forms were intended to be one and the same, and are therefore transcribed in the same way.

A third problem inherent to transcriptions of the Voynich MS text is a combination of these two points: the uncertain definition of word boundaries or word spaces. The word boundary definitions are based on subjective decisions made by transcribers deriving from the layout of the text.

Some initial considerations

The main transcriptions that are available do not all capture the same information. Those that have been collected in the interlinear file have some useful meta-data. This term means any descriptive data about the text, rather than the transcription itself. These data, among others, identify the location in the MS of each transcribed piece of text, but also additional information. All transcriptions in the interlinear file use the Eva transcription alphabet, but without the special extensions.

The v101 transcription by GC is likely to be the more accurate and consistent one, but this has a less standard location identification and lacks all other meta-data. It is using a different transcription alphabet, and for both historical and practical reasons it is not included in the interlinear file.

This last point is worth a small initial discussion. It will be addressed more fully later. In principle, there is no reason that all transcriptions in the interlinear file should use the same transcription alphabet. It should be possible to identify the alphabet used for each entry in one way or another. That Eva was used for all other ones lies in the fact that Eva was defined such that all existing transcriptions could be converted to Eva without loss of information, and in a reversible manner. For that reason there was no need to maintain the original alphabets, but these alphabets can still be used. The same is not true for v101, and a rendition into Eva might be possible using the extended characters, but this is not certain. It should in principle be possible to define a superset that completely includes both v101 and extended Eva, but it is also not certain that this is the best way forward. This will have to be addressed at some point.

The need to generate a better transcription was also discussed by Nick Pelling at >>this blog post with comments from several contributors. While the name or title "EVA 2.0" is not too appropriate in my opinion, I do think that this was more of a catch phrase rather than a serious proposal.

Making a new transcription is a very significant effort, and it is only worth doing if it really improves upon what is available now. It should preserve the advantages we have now, solve as many of the existing problems and shortcomings as possible, and bring additional advantages.

Exemplifying the problem

To judge the effort in making a new transcription, it would be useful to know how many characters there are in the written text of the MS.

Do we know the answer to this very simple question? No we don't.

After one hundred years of statistical analyses, the answer to this simple question is unknown. What's worse, if four people were asked today to make a count, they would come up with five different answers.

Partly, that is understandable. Which of the transcription files is complete? (Hint: none of them are). Furthermore, there is no agreed definition of the character set. The number depends on this definition. Even if this number cannot be determined yet, it would be important if two people made a count based on the same assumptions, that they end up with the same answer. They should be able to state their assumptions in such a way that someone else could repeat the count and come up with the same answer. Let me therefore insert an intermezzo related to standards and conventions.

Intermezzo: about standards and conventions

Why are they important?

Standards and conventions facilitate collaboration. They allow people to exchange data and results, and for everybody to 'talk the same language'. They allow verification.

This should be completely obvious, but let me just give a small example. Imagine that two people have made their own transcripions of part of the Voynich MS text. Now imagine that someone wanted to compute the entropy values of these two transcriptions. He has a tool to do that. He would need to read and process the two transcriptions, which is easy (and reliable) if both use the same file format, following some convention. If they are not following the same convention, the work is more difficult because the tool has to be modified to read one or the other. As a consequence it is also less reliable since there could be a bug that affects one and not the other. Next, a second person also has a tool to compute entropy. This could be used as a cross-check (verification). This only works well if there is a standard format that all developers of tools can rely on.

Do we have any, and if not, why not?

There are numerous examples of conventions related to the Voynich MS. In the days of Friedman, pages of the Voynich MS were identified with the so-called Petersen page numbers. We still find them, e.g., in Currier's paper (Table A). Nowaways, the standard way to refer to the pages in the MS is using the foliation of the MS. The latter is used in many web resources, and in the transcription files.

Use of the Eva transcription alphabet in discussions about the Voynich MS is an example of a de facto convention. It was not the result of a committee decision that was enforced through some document, but a personal effort of two people (Gabriel Landini and myself) that was considered useful and was generally adopted.

The naming of the various sections in the Voynich MS ("herbal", "biological", "pharmaceutical") is another example that is useful to illustrate some points. Not all people like all of these terms, and as a result, different people use different names for the same thing. While, fortunately, this is irrelevant for the chances of translating the Voynich MS text, it makes communication cumbersome. The point to be made here is that conventions do not need to represent the most accurate way to name or describe things. What matters is that they are usable and that they are generally accepted.

In 'real-life' international collaboration activities, committees or bodies are set up to discuss and agree on standards and conventions, and a majority decision is made. After that, also those who had a different opinion agree to adopt the standard. This is the process that makes collaboration possible. In the world of the Voynich MS, there are no such committees, and everybody is happy to do their own thing. Progress is made by individuals who are not 'talking' but 'doing'. The Eva alphabet, the interlinear transcription file and GC's v101 transcription (2), the Jason Davies Voyager (3) and the text/page browser at voynichese.com (4) are very good examples of this.

Classification of the problems to be resolved

I believe that all improvements that can be made to the present state of Voynich MS transcription can be classified largely into two groups:

  1. How to properly describe the transcribed text in the MS
  2. How to properly store and annotate this information for easy and consistent retrieval

(1) How to describe the text in the MS

The most frequently discussed point in this area is the question what the best transcription alphabet should look like. A large part of the main page is devoted to describing the various historical transcription alphabets, and some of their pros and cons. It is not certain (at least not to me) that there should be one 'best' and commonly agreed transcription alphabet to be used by everybody.

The main problem is the subjectiveness that should be removed, or at least minimised, from the transcription. This is a very complex issue, and in a way it comes down to the definition of the transcription alphabet. This is not the question whether the shape:

should be transcribed as "d" or "8". This is completely irrelevant. The real problem to be solved is how to group similar-looking characters into a single transcribed character.

Looking at this problem in a very abstract manner, one could argue that in principle all characters in the MS are different, since they have been written by hand. This is of course a completely useless way to transcribe the text, but it defines the starting point for the definition of any transcription alphabet:

In principle, every character in the MS could be described by a small graphics file extracted from the digital images. Whether this is a good solution in practice may be doubted, but at least this presents a (hypothetical) additional level of abstraction in between the digital scans and the transcribed text. It may be called the third level of abstration.

Since processing digital images (e.g. in an OCR process) is not yet practically possible, another intermediate level of abstraction could be introduced, namely that of a transcription 'super alphabet'. This would be one that captured all subtle differences between characters, resulting in a very large 'alphabet'. Without claiming that this is the only or even the best solution, I imagine such a super alphabet to be organised into 'character families', i.e. (using Eva just for illustration purposes) a family of all y-like characters, a family of all r-like characters, one of all sh-like characters etc. etc. It could consist of two identifiers for each character: the family identifier and the 'family member' identifier.

Such an alphabet would not be suited for practical statistical analyses, but it would allow researchers to define, use and experiment with their own alphabets. How one would arrive at this alphabet from the digital images is of course not yet clear. At least theoretically, this can be the fourth level of abstraction.

(2) How to store the transcription data

Issues to be resolved here may be developed from the points raised above:

  1. There is a need for a standardised format for representing all past and future transcriptions
  2. Layout information needs to be added, namely the precise location of each text element in the MS, and the handwriting characteristics: character size, writing direction, slant angle (if possible).

Eventually, and ideally, all information should be contained in an on-line database, with query tools to extract what one would need, in well-defined formats. Thus, when I wrote, near the top of this page, that standards like 'TEI' should be supported, I see this as an output from a query to this database.

The definition of such a database is a task that requires preparation. It is not even clear exactly what should be contained in it. Generally speaking, all tools and products that are available now should be able to work on the basis of it, for example any transcription file should be an abstract of it, but also tools like the Jason Davies voyager (see note 3) or the tool known as 'voynichese.com' (see note 4). The main thing that is presently not clearly defined is the way to identify the exact location on a page of any text item. The Jason Davies voyager has implemented one method, and 'voynichese.com' must also have implemented a solution for this.

Presently, the information that should go into this database is collected in several different files of similar, yet not consistent formats. A first step that is needed is to consolidate all this information in a common format based on consistent definitions, and without loss of information.

A new transcription file format

Outline of the issue

As discussed in the main page, the existing transcription files are represented in various different formats, and none of these formats is capable of representing all of them. Defining a common format for all transcriptions goes a long way towards solving the second of the two main issues identifed here. If all existing historical transcription files can be represented in a single format, it means that all can be accessed by the same tools. It will also allow the ingestion of all transcriptions into a common database at a later stage, using a single tool.

To achieve this, the following steps have been taken.

Inventory of existing transcription files

Three nearly complete transcriptions of the text exist, but how complete are they, really? To answer this question, I have compared all three with each other, and with the digital images of the MS. This showed first of all that there are some inconsistencies in counting the different loci. The following table presents the total counts:

Code Description Locus count
LZ The Landini-Zandbergen transcription (formerly: EVMT). 5342
LSI The Landini-Stolfi Interlinear file 5362
GC The v101 transcription file by Glen Claston 4486

It appeared that none of the transcriptions was really complete, and some loci were not included in any of them. Furthermore, there are several inconsistencies between the definition of individual loci in these files. The much lower number in the GC file is due to the fact that GC grouped many individual labels together under single loci, which he gave long descriptive names. The main difference between the LZ and LSI files is also related to the different grouping of multi-word labels. These differences are addressed in more detail now, in order to arrive at a common definition, where possible based on a majority, taking also into account the older transcription efforts initiated by Friedman and Currier.

Locus adjustments

Following are the major inconsistencies between the definition of individual loci in the main transcription files, and the way these can be resolved.

  1. As a first general rule, only text in the Voynichese script will be rendered in transcription files. All text in more or less readable writing using the Latin alphabet shall be contained in comments.
  2. As another general rule, no item spanning more than one line shall be contained in a single locus. Each line must have a separate locus ID. This affected several sets of labels in the biological section, main in the LZ transcription.
  3. The co-called 'titles' (for which see here) have been transcribed in some cases (FSG and LSI) as separate loci, and in some cases (LZ, GC) combined with the other part of the line. It was decided to make them separate loci, yet to include a code in the locus ID that would allow a user to recombine them if he preferred
  4. Some pages have text interrupted by large drawings, and the alignment of the text left and right of this is so bad that one can draw different conclusions as to how the lines should be combined. The most obvious case is given by the last 6 or 7 lines of the first paragraph on f34r. In LZ it was assumed that the continuation should be on the same height on the page, while LSI and GC assumed oblique alignment. The majority vote was adopted, and it was decided to introduce a dedicated comment that shows such a bad vertical alignment.
  5. The most difficult page for locus identification turned out to be f75v, which has three separate issues, one of which (labels split over two lines) was already resolved. Another issue is the set of detached individual characters after the first 5 lines of the text. It is not even clear whether this should be 5 or 6 but it was decided to follow the majority and count these separate. The same code in the locus ID as mentioned above (point 3) can be used by users to recombine these loci if desired. The third issue is the set of words and line fragments surrounding the pool at the bottom of the page. A majority solution could be a adopted here.
  6. For several of the sequences of words or characters in the margins of the MS, different locus ordering would appear to be appropriate. A compromise between consistency and accuracy was required. When the individual items were clearly aligned with the text (such as e.g. f49v), they were counted in horizontal order. Otherwise (e.g. f66r), they should be counted vertically.
  7. Some minor issues in counting labels in the pharmaceutical section.

Locus types

In the first transcription files, loci were simply numbered incrementally from 1, but these did not include transcriptions of special items like labels, circular text etc. In the frame of the EVMT activity, where these were all included, labels were prefixed by L, circular and radial text by C and R, individual words that were not near a drawing element by S and 'blocked' text (consecutive lines that were not part of the standard paragraphs) by B.

This usage was extended by Stolfi in the LSI file, but unfortunately not in a very consistent manner, which largely was due to the various different transcribers. Standard paragraph text was prefixed by P, but sometimes also P1, P2 etc if the different paragraphs could be counted easily. Thus, the first line could be P.1, P1.1 or P.1.1 depending on the page. Other codes were introduced, including half the alphabet, and also some lower-case letters. L sometimes meant 'Label' and sometimes 'left'. This still affects online tools that extract text from the interlinear file.

A new convention, more similar to EVMT, has been introduced, where the locus type is characterised by one of four main types (upper case characters), and a longer list of subtypes (second lower-case character or number), as follows.

Main type Description
P Text in paragraphs. The second character gives more details. The original paragraph text has type P0, whereas the old 'blocked' locus type becomes Pb. Other types identify centred or right-justified text, or titles.
L Labels. The second character specifies whether it is a label of a herb, zodiac element, nymph, etc. etc., and it is L0 if this is not near any drawing element.
C Circular text. Cc for clockwise and Ca for anti-clockwise.
R Radial text. Ri for inward writing and Ro for outward writing.

All details are provided in the transcription file format description.

The inventory

Based on the above rules, the total number of loci in the MS could be established. After adjusting the loci in each of the files according to the rules listed above, the completeness of each transcription file can now be presented in terms of the number of loci covered. In addition, I extracted my own transcription from the LZ combined effort, and will call it ZL in the following. The few loci that were missing in LZ were added to this file.

Code Description Locus count Percentage
FSG The FSG transcription. 4074 75.6
C-D The original transcription of D'Imperio and Currier 2195 40.7
Vnow The updated version of this file, made during the earliest years of the Voynich MS mailing list. 2550 47.3
TT The original transcription by Takeshi Takahashi. 5216 96.8
LSI The Landini-Stolfi Interlinear file 5341 99.1
LZ The unpublished Landini-Zandbergen transcription 5382 99.9
GC The v101 transcription file. 5367 99.6
ZL The "Zandbergen" part of the LZ transcription effort. 5389 100

Format adjustments

The new format is based on minimal adjustments to the conventions used by Gabriel Landini and myself in the frame of our transcription efforts in 1998-1999, which also went into the format definition of the interlinear file. Issues to be resolved are primarily related to the character set of the v101 alphabet, which was shown here. Problems exist with the following characters:

Char Clash Solution
( This is a character in v101, so it cannot be used as a ligature marker Use { } for ligature markers, and <! > for in-line comments
& This is a character in v101, so it cannot be used to represent high-ascii codes. Use @ for high-ascii codes.
* This is a character in v101, so it cannot be used to mark an illegible character. Use ? for illegible characters
\ This is a character in v101, so it cannot be used to indicate line wrapping (continuation of a locus on the next line). Use / for this purpose
| This is a character in v101, so it cannot be used to separate alternative options between [ ] brackets Use : for this purpose

Progress

In order to store and present the results of the on-going activity, I have set up a dedicated area at this web site where I am gradually collecting all information related to it. This includes the format definition of the 'Intermediate Voynich Transcription File Format' (IVTFF).

The main directory has a number of files that described briefly in the ReadMe file.

In addition, there is a subdirectory beta that has files that are only at beta-testing level, and which are also described briefly in a ReadMe file.

With the availability of these files, and a useful tool that is described below, it is now also possible to estimate the total number of characters, words and different words in the MS.

How many characters, how many words?

The result may be summarised as follows:

Characters: ~ 160,000 - 165,000
Words: ~ 38,000
Different words: ~ 9,500

Voynich Transcription Tool

VTT

As mentioned in the main page, in the course of the EVMT activity I developed a tool to pre-process transcription files in the format defined by Gabriel and myself. I have made very extensive use of this tool, among others to do the analyses presented here, here and here, but since the transcription was never published, it was never used by anyone else. Now that all transcription files have been converted to a consistent format (IVTFF), I also changed this tool to be able to process this new format.

IVTT

The new version of this tool has been named (prosaically) IVTT (Intermediate Voynich Transcription Tool). Its user guide is provided here, and the latest source code in C is provided here. Users also need to be familiar with the IVTFF format definition.

Rather than explaining the usefulness of this tool by duplicating information from these documents, I prefer to show this by giving a few simple examples.

Examples

One of many uses of the tool is to select specific parts of a transcription file. For example, the command:

ivtt +QO +@Lc ZL.txt

Can be used to extract all container labels in quire 15 from the ZL transcription file. The result is:

<f88r>         <! $I=P $Q=O $P=C $L=A $H=4>                                     
<f88r.1,@Lc>     otorchety
<f88r.12,@Lc>    otaldy
<f88r.23,@Lc>    ofyskydal
<f88v>         <! $I=P $Q=O $P=D $L=A $H=4>                                     
<f88v.1,@Lc>     okalyd
<f88v.11,@Lc>    otoram
<f88v.27,@Lc>    daramdal
<f89r1>        <! $I=P $Q=O $P=F $L=A>                                          
<f89r1.1,@Lc>    okchshy
<f89r1.11,@Lc>   ykyd
<f89r1.24,@Lc>   ykocfhy
<f89r2>        <! $I=P $Q=O $P=G $L=A>                                          
<f89r2.1,@Lc>    odory
<f89r2.9,@Lc>    otold[:y]
<f89r2.16,@Lc>   korainy
<f89r2.29,@Lc>   okain
<f89r2.30,+Lc>   yorainopaloiiry
<f89v1>        <! $I=P $Q=O $P=J $L=A>                                          
<f89v1.1,@Lc>    okoraldy
<f89v1.21,@Lc>   <!container>koeeorain

To remove the 'dressing' and leave only the transcribed text, the command issued above can be changed (for example) as follows:

ivtt -f1 -c5 +QO +@Lc ZL.txt

The same can be done reading the GC file:

ivtt -f1 -c5 +QO +@Lc GC.txt

or for Takeshi's transcription as included in the LSI file:

ivtt -f1 -c5 +QO +@Lc -tH LSI.txt

The results of the three commands are shown below, next to each other:

From ZL file   From GC file   From LSI file  
otorchety
otaldy
ofyskydal
okalyd
otoram
daramdal
okchshy
ykyd
ykocfhy
odory
otold[:y]
korainy
okain
yorain
choeesy
otory
opaloiiry
okoraldy
koeeorain
okoy1ck9
okae89
of9sh98ae
ohAe98
okoyap
8ayap7ae
oh159
9h98
9hoF9
o8oy9
okoe89
hoyan,9
ohan
9oyan
1oCs9
okay9
ogaeoZ9
ohayae79
hoCoy,an
otorchety 
otaldy 
ofyskydal 
okolyd 
otoram 
daramdal 
okchshy 
ykyd 
ykocfhy 
odory 
otoldy 
korain.y 
okain
yorain 
choeesy 
okory 
opaloiiry 
okoraldy 
koeeorain 

As another example, one may quickly see that centred lines in normal paragraph text occur equally in pages written in Currier language A and in Currier language B:

ivtt -x7 +@Pc +LA ZL.txt
>>
ytchas oraiin chkor
dainod ychealod
ychekch y kchaiin
saiinchy daldalol
sam chorly
otchodeey
dorain ihar
okar sheey shekealy
teol cheol otchey ???cheor cheol ctheol cholaiin chol qkar

and

ivtt -x7 +@Pc +LB ZL.txt
>>
oteor aiicthy
oteol cholkal qokal dar ykdy
pchedy chetar ofair arody
yteched ar olkey okeoam
s ar chedar olpchdy otol otchedy
dar oleey ol yy
otar chdy dytchdy
otoiir chedaiin otair otaly

Notes

1
See Clemens (2016) and Skinner (2017).
2
See here.
3
See here.
4
See here.

 

Contents Home Map

Copyright René Zandbergen, 2017
Comments, questions, suggestions? Your feedback is welcome.
Latest update: 22/09/2017