TOC Home Map
back

The analytical alignment alphabet

Introduction

When the STA alphabet was first designed, it was decided that it should be a synthetic superset of v101 and extended Eva, as this was the less complicated approach. This was first presented during the 2022 Voynich MS conference (1), which referred to intermediate STA version 16. STA was consolidated (as STA version 1) in early 2023, and this version is documented on this page with links to detailed additional documents.

While using STA to align transliterations against each other, there were a few cases where this alignment was complicated due to different combinations of strokes by the various transcribers. It was decided to resolve this by creating another superset of v101 and extended Eva, but to use the opposite approach, namely to make it analytical. This second, analytical alphabet organises strokes or minims of the Voynich characters into families, in a way similar to the STA alphabet.

Why yet another alphabet?

Already four different transliteration alphabets existed when STA was designed, five if one also counts the unused "Frogguy" (for which see here). STA was not designed to replace or supersede any of the original alphabets. These original alphabets were intended for text analysis. STA was designed as a superset of all existing alphabets, in order to be able to represent all transliterations in a common system. With the availability of the tool 'bitrans', creating a new alphabet is easy, and the conversion of existing transliterations from one alphabet to another is fully automatic.

The purpose of this additional new alphabet is primarily to align different transliterations. We will see further below that it is also very well suited for alignment of a transliteration to the digital images of the MS text, as in OCR applications. The alphabet was dubbed 'analytical alignment alphabet' or 'aaa'. As it encodes strokes or minims , this alphabet is highly verbose.

Example

An example may serve to clarify this approach. The character sequence    may be interpreted by one transcriber as two separate parts, and by another as a single shape. The first could write in Eva: 'c kh' and the second 'ckh'. When converted to STA, the first would have 'Ja Sa' and the second 'U1'. These are not obviously similar. In the stroke-based case of Eva, they are more clearly similar, differing only in the additional space. The 'aaa' alphabet will make this comparison very easy again.

Principles behind the aaa alphabet

The definition of 'aaa' follows from three principles:

While STA characters start with a capital, which is followed by a number or a lower-case character, aaa characters will consist of a lower-case character followed by a number (single digit). This removes any ambiguity between STA and aaa. Consequently, we may also speak of aaa families.

In the design of aaa we are able to avoid a few aspects of Eva that some of its users have commented on:

The use of a digit allows all different c-like shapes to be represented by a 'c' character. The four digits '0' to '3' are reserved for four specific cases:

The digits '4' to '9' are available for identifying specific variants of certain characters.

aaa at the high level

The list of 'aaa' families is given in the following table.

aaa family Description
a Eva-a
b Tail up from the bottom, part of Eva-b, Eva-n, Eva-u
c Eva-e, also when connected to left or right, such as Eva-c, Eva-h
d Eva-d, and similar shapes (Eva-j)
e Eva-l
g Eva-q
i Eva-i, also connected to left or right
j Right-hand part of Eva-g and Eva-m
l Left-hand part of Eva-f and Eva-k
o Eva-o
p Right-hand part of Eva-k and Eva-t
q Left-hand part of Eva-p and Eva-t
t Tail up from the top, part of Eva-r, Eva-s
u Collection of rare unclassifiable characters
v Quote as in Eva-sh
x Right-hand part of Eva-f and Eva-p
y Eva-y
z Use for illegible readings

With this, we may show the definition of the most common characters. The following table has been modeled after Table 1 on this page.

Char Eva aaa   Char Eva aaa
q g0   i i0
o o0    il i0 e0
d d0    iil i0 i0 e0
y y0    iiil i0 i0 i0 e0
s c0:t0    ir i0 i0:t0
l e0    iir i0 i0 i0:t0
r i0:t0    iiir i0 i0 i0 i0:t0
 ch c2:c1   n i0:b0
 sh c2:v3:c1    in i0 i0:b0
t q2:p1    iin i0 i0 i0:b0
p l2:x1    iiin i0 i0 i0 i0:b0
k l2:p1   m i2:j1
f l2:x1    im i0 i2:j1
 cth c2:q3:p3:c1    iim i0 i0 i2:j1
 cph c2:q3:x1:c1    iiim i0 i0 i0 i2:j1
 ckh c2:l3:p3:c1   g c2:j1
 cfh c0:l3:x1:c1   j d4
a a0   x e4
e c0   v a5

Some explanatory comments:

In general, we shall consider alphabets like Currier, Eva and v101 'native' alphabets for transliterations, and STA and aaa 'generic' alphabets.

Aligning text to images (OCR)

Two transliterations expressed in 'aaa' will look very similar, as long as they are both reasonably accurate. In addition to this, the human eye can easily map any written text in the MS to a transliteration using 'aaa'. An OCR tool should be able to do the same, so 'aaa' is well suited as an input to OCR applications.

In handwritten text, there is great variation in how characters have been written, and a single digit is completely inadequate to capture all these variations. For that reason, 'aaa' will define only the most relevant and distinctive shapes, while any OCR process may identify (and label) many more variations of each 'aaa' character. This means that for every 'aaa' character there will be a list of variants as they are encountered in the handwritten text.

These variants are described in small image files which will be called 'character models'. These character models may be based on existing font defitions or on the Voynich MS digital images. A detailed description of character model files and associated catalogues is outside the scope of this document. Following is just a brief summary.

Character models are 8-bit BMP files, with a colour map that encodes a grey scale from 00 to FF. The background is dark, using lightness value 16 (hexadecimal 10) whereas the writing is light using lightness value 239 (hexadecimal EF). Values in between are allowed. Values 0 and 255 (hexadecimal 00 and FF) represent cells that are not part of the model and should be ignored when using the model. Each model has an anchor point which is encoded in the unused part of the colour map. The anchor point is either the central point where the character should touch the line of writing, or some other relevant point.

Character models shall use extension ".bmp" and the file names shall have 10 to 12 characters, where the first two are the aaa character name and the third is either a plus or a minus sign. This leaves a 3 - 5 character string to identify different models for each aaa character.

Deriving the full aaa definition

Reduced STA

It was not necessary to derive the 'aaa' definition from a new comparison between v101 and extended Eva, but it could be derived from STA1 directly. Taking into account the above brief discussion of character variations, as a first step, a 'reduced' version of STA1 was defined (STA1R), that was used as the model for 'aaa'. Following is the list of considerations for this reduction.

  1. v101 distinguishes two variants of the characters and (also for their pedestalled versions) depending on whether the final stroke to the left has a hook or not. It was decided not to keep this distinction, even though both variants have numerous appearances in the full MS text.
  2. v101 distinguishes many different variants of the character  depending on the size, shape, and positioning of the 'curl' on top. It has been decided to maintain only a subset of these.
  3. Character variants that appear rarely in the MS, and are sufficiently similar to more frequent characters, will not be distinguished from these more frequent characters.

Whereas the full definition of STA1 includes 299 different characters, reduced STA includes only 197. These 197 STA codes are indicated in the full STA1 definition document by an asterisk (*) in the third column.

A dedicated bitrans table has been set up for converting full STA to reduced STA: STA1R.bit. (For more details about bitrans, see here). This table is of course one-directional. The main reason for this reduction of characters is that this shall be the basis for a 'Reference' transliteration, which is discussed further below.

As already mentioned above, different character models may capture the variations that were ignored during this reduction process.

Full aaa definition

A first complete definition of aaa was achieved on 30 July 2023. Another iteration leading to a stable version of reduced STA and a stable aaa version was completed on 18 May 2025. A link to the details of this definition will be added here in the near future.

There are two bitrans tables for converting STA to aaa, both of which are irreversible:
STA-aaa.bit to convert full STA to aaa,
STAR-aaa.bit to convert reduced STA to aaa.
In order to convert a text in a native transliteration alphabet to aaa, this first needs to be converted to STA.

A Reference Transliteration

Introduction

The alignment of all transliterations to each other would be greatly facilitated if one knew exactly how many characters there are in each locus. The words of each transliteration can then refer to these. As long as we do not know this, we may achieve the same objective by setting up a reference transliteration. The main purpose of this reference is to model as closely as possible the number of characters in each locus.

It is desirable that this reference is also accurate, but that is necessarily limited by the accuracy of the available transliterations. Its accuracy can be improved iteratively once OCR is possible. OCR itself would be greatly facilitated if one had a good a priori transliteration to work from. That, again, is what the reference transliteration can provide. Clearly, a good reference transliteration will be the result of an iterative process.

Having a reference transliteration in the STA alphabet will allow for a conventient representation of all the items composing the Voynich MS text. In the IVTFF format 5385 loci are defined. The largest number of loci on any single page is 160, which is the case for the Rosettes page (the combination of f85v and f86r).

Every page has a name that follows from the traditional foliation, such as f39v or f102r2. However, every page can also be represented by two characters, which are the page variables $Q and $P in the IVTFF format respectively. Here, $Q is the quire letter, which runs from A to T, and $P is the 'page in quire' running from A up to X, depending on each quire. While these two-character page codes are not easy to memorise, their use has some very practical advantages:

  1. All page codes have the same (short) length of two characters
  2. Their alphabetical sorting is also the order in which they appear in the MS
  3. On Unix-like systems, the regexp '??' expands to the list of pages in the correct order

As a result, every locus can be represented as a 5-character sequence of two characters identifying the page followed by three digits identifying the locus on the page. This five-character identification of loci is used in the majority of my text processing applications, including a prototype Voynich transliteration database.

Furthermore, the longest locus in the MS in terms of STA codes is KH001. This is on page 'KH' which is f72v3, and locus 1 is the large outer circle of text. It has a length of 209 STA characters. This means that every character (in STA) in the MS can in principle be identified by a sequence of 6 characters, in which the first two are the page, the next two (in hexadecimal) are the locus on the page and the last two (in hexadecimal) are the character in the locus. This numbering of characters will be based on the reference transliteration, and can be improved iteratively, as this reference is improved.

Approach

The way to achieve a first reference is to align the GC and ZL transliterations against each other and create an optimal combination. When doing this, we immediately encounter some obvious issues in both transliterations. As the maintainer of the ZL transliteration I can correct these as necessary, but I cannot do this with the GC file. What I can do, however, is to correct any positively identified issues in the GC file by creating a working copy and modifying only this working copy. While changes to the ZL file can be made also for subjective changes, only clear mistakes in the GC file have been corrected in this manner.

For the sake of the reference, subtle difference between characters do not play a major role. These can be resolved later. For this reason, it can be expressed in the 'reduced' STA alphabet described above, which is inherent in the defintion of the 'aaa' alphabet. Converting both the ZL and GC transliterations to this reduced character set further increases their similarity, and improves the success of the alignment.

The alignment between the two files was based on the aaa alphabet, but the character comparisons were based on the STA codes after this alignment. (Every STA code can be seen as a box with a few aaa characters inside).

The result

The result of the combination of the ZL and GC transliterations, that were adapted as described above, has been named reference transliteration version 1a, with the code 'RF'. It consists of 157,267 characters in the STA alphabet, of which 4% was selected from the (modified) GC file, 0.8% from the ZL file and the remainder was common. It included both characters that exist in extended Eva but not in v101, and characters in v101 that do not exist in extended Eva. Therefore, RF1a could only be expressed in STA (and aaa), but in no native alphabet.

In May 2025, it was decided to fix some known issues affecting several transliteration files. While doing this, it also appeared that RF1a still included some STA codes that were not part of the reduced STA alphabet. When this was fixed, it turned out that only three reduced STA codes did not have a representation in Eva, so these were also added to the Eva alphabet. The Eva definition included here already refers to this updated definition. New new characters have codes 221-223.

Furthermore, one label in the pharmaceutical section of the MS was identified to be a shine-through from the other side, so the corresponding locus does not exist. At the same time, the three characters of Johannes Marcus Marci's alphabet list on f1r have been removed from the transliterations (2). This means that a total of four loci have been removed, reducing the complete set from 5389 to 5385 (3).

The resulting, updated reference transliteration is called RF1b, and has 157,254 characters. Its version in (reduced) STA is available in this table. It has also been converted to full Eva and reduced Eva, and these two versions are available here.

back

File formats

Since 2017 a common file format for transliteration files has been in use: the IVTFF format. It was first presented at the 2020 Voynich MS conference (see note 1), when it was still at format version 1.7. It is presently at version 2.0, and is defined in this PDF document. Files expressed in this format have been the source files for transliterations, but in recent times, I have integrated them into a prototype database (4). The IVTFF format has become one of the export formats of this database.

The database also supports a second export format which is suited for transliterations using the STA or aaa alphabets. This format is called VDBTF. Whereas the STA alphabet can be used both in IVTFF and VDBTF formatted files, aaa data is only supported by the VDBTF format. Given that native files are only available in IVTFF format, it is necessary to also do a format conversion in the process of converting native to aaa.
Note that at present no files using aaa and no files using the VDBTF format are available at this web site.

The difference between VDBTF and IVTFF may be summarised as follows:

The transcriber ID is not used to identify a persion, but to identify a transliteration file, and can take values 1-9.

The separator character may be:

Ch Meaning In IVTFF Comment
: Connected to previous character (N/A) Can only be used with 'aaa'
  Close to the previous character (none) Used between adjacent characters
, Uncertain word space ,  
. Certain word space .  
! The start of a line (none) But not also the start of a paragraph
% The start of a paragraph <%>  
- A drawing intrusion in the text <-> The writing is aligned vertically
~ A drawing intrusion in the text <~> The writing is not aligned vertically

N/A means 'not applicable' (or: does not exist). The code '~' is only used in the ZL transliteration, in fewer than 10 cases.

As an example, the word  at the start of a line (but not the start of a paragraph) would appear in a VDBTF file in STA1 and in aaa-1 as follows:

!B1 A3 G1
!d0 a0 i0 i0 i0:b0

back

Notes

1
See Zandbergen (2022)
2
See the >>blog entry of Lisa Fagin Davis.
3
The four lost loci have been removed from the ZL and RF transliterations. The annotations of Marci did not appear in any other transliterations, but the lost pharmaceutical label is still included in the transliterations of Takeshi Takahashi and Glen Claston. At present I am uncertain how best to deal with this issue.
4
Attempts to make this a publicly available MySQL database have so far been unsuccessful, but this will be explored further.

 

TOC Home Map
back
Copyright René Zandbergen, 2025
Comments, questions, suggestions? Your feedback is welcome.
Latest update: 27/05/2025