Contents Home Map

The analytical alignment alphabet

Introduction

When the STA alphabet was first designed, it was decided that it should be a synthetic superset of v101 and extended Eva, as this was the less complicated approach. This was first presented during the 2022 Voynich MS conference (1), which referred to intermediate STA version 16. STA was consolidated (as STA version 1) in early 2023, and this version is documented on this page with links to detailed additional documents.

While using STA to align transliterations against each other, there were a few cases where this alignment was complicated due to different combinations of strokes by the various transcribers. It was decided to resolve this by creating another superset of v101 and extended Eva, but to use the opposite approach, namely to make it analytical. This second, analytical alphabet organises strokes or minims of the Voynich characters into families, in a way similar to the STA alphabet.

Why yet another alphabet?

Already four different transliteration alphabets existed when STA was designed, five if one also counts the unused "Frogguy" (for which see here). STA was not designed to replace or supersede any of the original alphabets. These original alphabets were intended for text anaylsis. STA was designed as a superset of all existing alphabets, in order to be able to represent all transliterations in a common system. With the availability of the tool 'bitrans', creating a new alphabet is easy, and the conversion of existing transliterations from one alphabet to another is fully automatic. The purpose of this additional alphabet is primarily to align different transliterations. We will see further below that it is also very well suited for alignment of a transliteration to the digital images of the MS text, as in OCR applications. The alphabet was dubbed 'analytical alignment alphabet' or 'aaa'. As it encodes strokes or minims , this alphabet is highly verbose.

Example

An example may serve to clarify this approach [maybe find a good example in a MS image]. The character sequence C   Kh may be interpreted by one transcriber as two separate parts, and by another as a single shape. The first could write in Eva: 'c kh' and the second 'ckh'. When converted to STA, the first would have 'Ja Sa' and the second 'U1'. These are not obviously similar. In the stroke-based case of Eva, they are more clearly similar, differing only in the additional space. The "aaa" alphabet will make this comparison very easy again.

Principles behind the aaa alphabet

The definition of 'aaa' follows from three principles:

While STA characters start with a capital, which is followed by a number of a lower-case character, aaa characters will consist of a lower-case character followed by a number (single digit). This removes any ambiguity between STA and aaa. Consequently, we may also speak of aaa families.

In the design of aaa we may avoid a few aspects of Eva that some of its users have commented on:

The use of a digit allows all different c-like shapes to be represented by a 'c' character. The four digits '0' to '3' are reserved for four specific cases:

The digits '4' to '9' are available for identifying specific variants of certain characters.

aaa definition

The list of 'aaa' families is given in the following table.

aaa family Description
a Eva-a
b Tail up from the bottom, part of Eva-b, Eva-n, Eva-u
c Eva-e, also when connected to left or right, such as Eva-c, Eva-h
d Eva-d, and similar shapes (Eva-j)
e Eva-l
g Eva-q
i Eva-i, also connected to left or right
j Right-hand part of Eva-g and Eva-m
l Left-hand part of Eva-f and Eva-k
o Eva-o
p Right-hand part of Eva-k and Eva-t
q Left-hand part of Eva-p and Eva-t
t Tail up from the top, part of Eva-r, Eva-s
u Collection of rare unclassifiable characters
v Quote as in Eva-sh
x Right-hand part of Eva-f and Eva-p
y Eva-y
z Use for illegible readings

The following table has been modeled after Table 1 on this page.

Char Eva aaa   Char Eva aaa
q g0   i i0
o o0   il i0 e0
d d0   iil i0 i0 e0
y y0   iiil i0 i0 i0 e0
s c0:t0   ir i0 i0:t0
l e0   iir i0 i0 i0:t0
r i0:t0   iiir i0 i0 i0 i0:t0
ch c2:c1   n i0:b0
sh c2:v3:c1   in i0 i0:b0
t q2:p1   iin i0 i0 i0:b0
p l2:x1   iiin i0 i0 i0 i0:b0
k l2:p1   m i2:j1
f l2:x1   im i0 i2:j1
cth c2:q3:p3:c1   iim i0 i0 i2:j1
cph c2:q3:x1:c1   iiim i0 i0 i0 i2:j1
ckh c2:l3:p3:c1   g c2:j1
cfh c0:l3:x1:c1   j d4
a a0   x e4
e c0   v a5

Some explanatory comments:

The 'aaa' alphabet is only used in the VDBTF file format, for which see below. In general, we shall consider alphabets like Currier, Eva and v101 'native' alphabets for transliterations, and STA and aaa 'general' alphabets.

Aligning text to images (OCR)

Two transliterations expressed in 'aaa' will look very similar, as long as they are both reasonably accurate. In addition to this, the human eye can easily map any written text in the MS to a transliteration using 'aaa'. An OCR tool should be able to do the same, so 'aaa' is well suited as an input to OCR applications.

In handwritten text, there is great variation in how characters have been written, and a single digit is completely inadequate to capture all these variations. For that reason, 'aaa' will define only the most relevant and distinctive shapes, while any OCR process may identify (and label) many more variations of each ‘aaa’ chraracter. This means that for every aaa character there will be a list of variants as they are encountered in the handwritten text.

These variants are described in small images files which will be called "character models". These character models may be based on existing font defitions or on the Voynich MS digital images. A detailed description of character model files and associated catalogues is outside the scope of this document. Following is just a brief summary.

Character models are 8-bit BMP files, with a colour map that encodes a grey scale from 00 to FF. The background is dark, using lightness value 16 (hexadecimal 10) whereas the writing is light using lightness value 239 (hexadecimal EF). Values in between are allowed. Values 0 and 255 (hexadecimal 00 and FF) represent cells that are not part of the model and should be ignored when using the model. Each model has an anchor point which is encoded in the unused part of the colour map. The anchor point is either the central point where the character should touch the line of writing, or some other relevant point.

Character models shall use extension ".bmp" and the file names shall have 8 to 12 characters, where the first two are the aaa character name and the third is a minus sign / hyphen. This leaves a 1 - 5 character string to identify different models for each aaa character.

Deriving the full aaa definition

Reduced STA

It was not necessary to derive the 'aaa' definition from a new comparison between v101 and extended Eva, but it could be derived from STA-1 directly. Taking into account the above brief discussion of character variations, as a first step, a 'reduced' version of STA-1 was defined (STA-1R), that was used as the model for 'aaa'. Following is the list of considerations for this reduction.

  1. v101 distinguishes two variants of the characters f and p (also for their pedestalled versions) depending on whether the final stroke to the left has a hook or not. It was decided not to keep this distinction, even though both variants have numerous appearances in the full MS text.
  2. v101 distinguishes many different variants of the character Sh depending on the size, shape, and positioning of the 'curl' on top. It is decided to maintain only a subset of these.
  3. Character variants that appear rarely in the MS, and are sufficiently similar to more frequent characters, will not be distinguished from these more frequent characters.

In the third case it would be tempting to remove all characters that have no exact definition in extended Eva, but this was not the desired criterium. The reduced STA table will still include some characters that cannot be represented exactly in extended Eva, and these will require a separate treatment. Effectively, they will receive new extended Eva codes.

In all three cases, different character models may capture the variations that were ignored during this reduction process.

The result of this exercise is a bitrans table which maps the characters that should be removed to other STA characters. This table is of course one-directional. The main reason for this reduction of characters is that this shall be the basis for a 'Reference' transliteration, which is discussed further below.

Full aaa definition

A first complete definition of aaa was achieved on 30 July 2023, leading to the first stable version aaa-1. The details of this definition may be added here.

The initial STA-1 to aaa-1 bitrans table is defined only for the reduced STA character set. This means that converting any native transliteration to aaa involves three steps (and is irreversible):

  1. Convert native to full STA-1 (this is reversible).
  2. Convert full STA-1 to (reduced) STA-1R (this is not reversible).
  3. Convert STA-1R to aaa-1 (this is also not reversible)

A practical aspect: whereas files using aaa are always in the VDBTF format (see below), while native files are only available in IVTFF format, it is necessary to do a format conversion between steps 1 and 3 above. For this conversion a dedicated tool ('i2v') has been created.

Based on the above, also a full STA-1 to aaa-1 table has been created. This table can be accessed by applications and tools for ad hoc STA-1 to aaa-1 conversions without invoking bitrans.

File formats

Since 2017 a common file format for transliteration files has been in use: the IVTFF format. It was first presented at the 2020 Voynich MS conference (see note 1), when it was still at format version 1.7. It is presently at version 2.0, and is defined in this PDF document. Files expressed in this format have been the source files for transliterations, but in recent times, I have integrated them into a prototype database (2). The IVTFF format has become one of the export formats of this database.

The database also supports a second export format which is suited for transliterations using the STA or aaa alphabets. This format is called VDBTF. The difference between VDBTF and IVTFF may be summarised as follows:

The transcriber ID is not used to identify a persion, but to identify a transliteration file, and can take values 1-9.

The separator character may be:

Ch Meaning In IVTFF Comment
: Connected to previous character (N/A) Can only be used with 'aaa'
  Close to the previous character (none) Used between adjacent characters
, Uncertain word space ,  
. Certain word space .  
! The start of a line (none) But not also the start of a paragraph
% The start of a paragraph <%>  
- A drawing intrusion in the text <-> The writing is aligned vertically
~ A drawing intrusion in the text <~> The writing is not aligned vertically

N/A means 'not applicable' (or: does not exist). The code '~' is only used in the ZL transliteration, in fewer than 10 cases.

As an example, the word daiin at the start of a line (but not the start of a paragraph) would appear in a VDBTF file in STA1 and in aaa-1 as follows:

!B1 A3 G1
!d0 a0 i0 i0 i0:b0

A Reference Transliteration

Introduction

The alignment of all transliterations would be greatly facilitated if one knew exactly how many characters there are in each locus. The words of each transliteration can then refer to these. As long as we do not know this, we may achieve the same objective by setting up a reference transliteration. The main purpose of this reference is to model as closely as possible the number of characters in each locus.

It is desirable that this reference is also accurate, but that is necessarily limited by the accuracy of the available transliterations. Its accuracy can be improved iteratively once OCR is possible. OCR itself would be greatly facilitated if one had a good a priori transliteration to work from. That, again, is what the reference transliteration could provide. Clearly, any reference transliteration will be the result of an iterative process.

Having a reference transliteration in the STA alphabet will allow for a conventient representation of all the items composing the Voynich MS text. In the IVTFF format 5389 loci are defined. The largest number of loci on any single page is 160, which is the case for the Rosettes page (the combination of f85v and f86r).

Every page has a name that follows from the traditional foliation, such as f39v or f102r2. However, every page can also be represented by two characters, which are the page variables $Q and $P in the IVTFF format respectively. Here, $Q is the quire letter, which runs from A to T, and $P is the 'page in quire' running from A up to X, depending on each quire. While these two-character page codes are not easy to memorise, their use has some very practical advantages:

  1. All page codes have the same (short) length of two characters
  2. Their alphabetical sorting is also the order in which they appear in the MS
  3. On Unix-like systems, the regexp '??' expands to the list of pages in the correct order

As a result, every locus can be represented as a 5-character sequence of two characters identifying the page followed by three digits identifying the locus on the page. This five-character identification of loci is used in the majority of my text processing applications, including a prototype Voynich transliteration database.

Furthermore, the longest locus in the MS in terms of STA codes is KH001. This is on page 'KH' which is f72v3, and locus 1 is the large outer circle of text. It has a length of 209 STA characters. This means that every character (in STA) in the MS can in principle be identified by a sequence of 6 characters, in which the first two are the page, the next two (in hexadecimal) are the locus on the page and the last two (in hexadecimal) are the character in the locus. This numbering of characters will be based on the reference transliteration, and can be improved iteratively, as this reference is improved.

Approach

The way to achieve a first reference is to align the GC and ZL transliterations against each other and create an optimal combination. Attempts to do this quickly point out a number of specific issues in each of these transliterations. As the maintainer of the ZL transliteration I can correct it as necessary, but I cannot do this with the GC file. What I can do, however, is to correct any positively identified issues in the GC file by creating a working copy and modifying only this working copy. While changes to the ZL file can be made also for subjective changes, only clear mistakes in the GC file have been corrected in this manner.

For the sake of the reference, subtle difference between characters do not play a major role. These can be resolved later. For this reason, it can be expressed in the 'reduced' STA alphabet described above, which is inherent in the defintion of the 'aaa' alphabet. Converting both the ZL and GC transliterations to this reduced character set further increases their similarity, and improves the success of the alignment.

The alignment between the two files was based on the aaa alphabet, but the character comparisons were based on the STA codes after this alignment. (Every STA codes can be seen as a box with a few aaa characters inside).

The result

The result of the combination of the ZL and GC transliterations, that were adapted as described above, has been named reference transliteration version 1a, with the code 'RF'. It consists of 157,267 characters in the STA alphabet, of which 4% was selected from the (modified) GC file, 0.8% from the ZL file and the remainder was common. It includes both characters that exist in extended Eva but not in v101, and characters in v101 that do not exist in extended Eva. Therefore, it can only be expressed in STA (and aaa), but in no native alphabet. It has also been converted (reduced) to basic Eva for convenience, and the two versions are available on here (in Basic Eva) and here (in STA).

Notes

1
See Zandbergen (2022)
2
Attempts to make this a publicly available MySQL database have so far been unsuccessful, but this will be explored further.

 

Contents Home Map
Copyright René Zandbergen, 2024
Comments, questions, suggestions? Your feedback is welcome.
Latest update: 29/11/2024