Contents Home Map

This page is still incomplete and, to use a term from the earlier days of the WWW: 'under construction'.

What we may learn from the MS text entropy

Introduction

General

The general topic of text entropy, its meaning and how it can be estimated for a given text has been introduced here. In general, the entropy is a single value computed from a frequency or probability distribution. Clearly, each frequency distribution will have its corresponding entropy value, but any particular entropy value can match an infinite number of different distributions. The entropy value is maximum in case all frequencies are equal, and is smaller in case this distribution is skewed.

All analyses that are presented here are based on counts of characters. This means that we are tacitly equating probability with frequeny, while strictly speaking the frequency is an estimate of the probability.

The first order, or single character entropy is computed from the distribution of character frequencies in a text. The second order, or chraracter pair entropy is computed from the distribution of character pairs.

If the text were an arbitrary sequence of characters with a given single-character frequency distribution, then, in each character pair, there would be no dependency between the first and second character. The second, or following character does not depend on the previous character. In this case the character pair entropy value is twice that of the single character entropy. In reality, this is not the case. We will come back to this further below.

One of the choices to be made is how to treat word spaces (effectively space haracters). One can either count them among the set of characters, or not. In case one does not, one also limits the character pairs to be analysed to those 'within words'.

Entropy and simple substitution ciphers

From the definition (i.e. the formula) of entropy, it is immediately obvious that the entropy values do not change if a piece of text is encoded using a simple substitution code. If characters are consistently replaced by others, the counts, and therefore the list of frequencies is not changed. This will not be true for other types of substitution codes (one-to-many, many-to-one or many-to-many).

In case the Voynich MS were a simple substitution cipher of a meaningful plain text in some known language, then the character counts (and therefore the entropy) would be similar to other texts in the same language. This is what Bennett looked at (1), and he found that the entropy values of the Voynich MS text were considerably lower than that of most comparison texts in known languages that he used. We will do the same here, but not only look at the entropy values, but also at the detailed distributions.

Character frequency distribution

The following character distributions have been computed for the Voynich MS text from existing, large transcription files. The alphabets used, and the source of the files, has been explained on this page. The files have been pre-processed (2), in order to remove illegible characters and characters that are not standard ASCII. The numbers on the vertical scale are counts. Since we do not know the sorting order of the characters, initially they are listed by decreasing frequency.


Figure 1: The Currier-D'Imperio transcription, using the Currier alphabet.


Figure 2: the FSG transcription, using the FSG alphabet.


Figure 3: the Zandbergen-Landini transcription, using the Eva alphabet.

In the following figure, the same transcription has been converted to the Curva alphabet (which has been defined here):



Figure 4: the same (ZL) transcription using the Curva alphabet.

Differences between these plots are due to the different definitions of the alphabets, and to some extent also to the different scope of the files. That second point mainly affects the C-D transcription file, which is only half the size of the others.

These plots may be compared to one for a known plain text. Many such texts have been analysed, and below I include one example, from the first 120,000 characters of the Latin edition of Dioscurides by Mattioli (3).


Figure 5: the Latin text of Mattioli

It is hard to draw any firm conclusions from a visual comparison of these plots. The frequency distribution of the Latin texts appears to decrease a bit more gradually than those of the Voynich MS transcriptions, but this is a qualitative impression. The following table lists the single character entropy values for the four different Voynich MS transcriptions, and for four plain texts. In these figures, spaces have not been counted as characters. Apart from the single character entropy values for all chracters, it also lists the character entropies for the first characters of words and the last characters of words.

Text Language Nr. of chars. Character
entropy
First char.
entropy
Last char.
entropy
C-D (Voynich) 66,587 3.857 3.437 2.689
FSG (Voynich) 142,770 3.829 3.382 2.557
ZL (Voynich) 188,862 3.862 3.206 2.541
ZL (curva) (Voynich) 156,414 3.872 3.369 2.686
 
Mattioli Latin 108,960 4.036 4.122 3.355
Pliny Latin 101,169 3.998 4.073 3.380
Dante Italian 130,142 4.014 4.033 2.941
Tristan German (4) 75,473 4.022 4.126 3.295

The single character entropy of the Voynich text is consistently lower than that of the source texts, by a relatively small margin. The four Voynich transcriptions have a mean entropy of 3.829 ± 0.044 and the four plain texts have 4.014 ± 0.016. The difference between the two averages is 0.189, which is still significantly greater than the variation in either value.

A clearer picture emerges from the statistics of the first and last characters of the words. For the four plain texts, the first characters have almost identical entropies as the set of all characters. The word final characters have a much reduced entropy value. This is most conspicuous for the Italian text. For the Voynich text we also see that word initial characters have a significantly reduced entropy, while the reduction for the word final characters is even more pronounced. This is shown in the following plot, where the entropy of the word-initial characters is on the horizontal scale, and the word-final characters on the vertical scale.


Figure 6: character entropies of first characters (horizontal) and last characters (vertical) of words

Character pair frequency distribution

Introduction

As already shown by Bennett and others (5), it is especially for character pairs that the entropy values of the Voynich MS text are anomalous. We shall see below that this is not just a matter of a lower entropy value, but the entire character pair frequency distribution is unusual.

Latin text

Let us start by visualising the character pair frequency distribution of the Latin text of Mattioli, again not counting word spaces. In all of the following plots, the first character is on the horizontal scale (in decreasing frequency from left to right), and the second character on the vertical scale (in decreasing frequency from top to bottom). In the first plot, the left figure shows the actual distribution of character pair frequencies. The order of the characters is the same as in Figure 5 (the plot labeled "Mattioli") above. The figure on the right shows what the pair frequencies would have been, if there was no dependency between the two characters (6). The difference clearly shows the preference of certain characters to combine with others.


Figure 7: character pair frequency distribution for the Mattioli text (Latin)

To visualise the difference, we may also plot the ratio of the left and right figures. This is shown below. Red colours mean a preference or "over-representation" of certain pairs, while blue means an "under-representation".


Figure 8: relative character pair frequency distribution for the Mattioli text (Latin)

For Voynich MS text

We may now draw the same plots for other texts. The following is for the FSG transcription of the Voynich MS. It uses the same colour scale, which is not repeated.


Figure 9: character pair frequency distribution for the Voynich MS text (FSG)

We see immediately that there is a very different pattern. The difference may be characterised by two aspects:

  1. For the Latin text the figure is largely, though not exactly, symmetrical. This is much less the case for the Voynich text. This means that the Voynich text has more restrictions, or preferences, about the order of characters in pairs.
  2. For the Latin text there are reasonably large blocks in which the majority of character combinations is allowed to occur. For the Voynich MS text, however, the figure has more gaps. This means that there are more restrictions on which characters may follow each other.

At this point we may come back to an observation about the behaviour of entropy in relation with simple substitution ciphers. In the above plots, the squares have not been labeled with the characters they represent. They have simply been sorted from high to low frequency. Also these plots would remain completely unmodified if the text was passed through a simple substitution cipher.

Thus, if the text of Mattioli were encrypted by a simple substitution, this text would immediately become illegible, but the character pair plot as shown on this page would look identical to the one for the plain text.

This clearly demonstrates that it is not possible to apply a simple substitution to the FSG transcription of the Voynich MS text, and come up with meaningful Latin. Many character combinations will be missing. To visualise this, the following figure presents the two plots of the ratios, where the one for Latin and for the Voynich MS text (in FSG) are shown side by side.


Figure 10: relative character pair distributions of Latin and Voynich text (FSG) compared

Comparison of all cases

Similar plots have been made for the other plain texts and for the other Voynich MS transcriptions, which are shown below.



Figure 11: relative character pair distributions for all 8 sample texts

A few observations from these plots may be highlighted. The figures for the two Latin texts are very similar, showing that the character pair distribution is a reasonable indicator of the language involved. The German text looks different from the other three plain texts. This is caused by the relative frequency of vowels and consonants, whch is further discussed below. The four plots for the Voynich text are quite different.

Conditional entropy

It was stated before that, in case the characters in a text are independent from the preceding character, the character pair entropy has a value which is twice the single character entropy. In reality they are of course dependent, as all the plots have clearly shown, so the character pair entropy is less than twice the single character entropy. We may call 'conditional character entropy' the difference between character pair entropy and single character entropy. It follows that the conditional character entropy is less than the single character entropy. The following table gives the values for the 8 cases.

Text Language Character
entropy
Conditional
entropy
C-D (Voynich) 3.857 2.085
FSG (Voynich) 3.829 2.052
ZL (Voynich) 3.862 1.836
ZL (curva) (Voynich) 3.872 2.124
 
Mattioli Latin 4.036 3.234
Pliny Latin 3.998 3.266
Dante Italian 4.014 3.126
Tristan German 4.022 3.039

In this case, the differences between Voynich text and plain texts in Latin, Italian and German are quite spectacular, and reflect the significant differences seen in the plots. Also these tabular values may be plotted against each other, with the single character entropy on the horizontal scale and the conditional character entropy on the vertical scale. It is immediately obvious that the difference between the various known languages is much smaller than the difference between Voynich text and any of these languages. The single point among those for the Voynich MS text files that is somewhat separated from the other three is for the ZL transcription which uses the Eva transcription alphabet, which tends to represent some of the script characters by pairs of transcription characters, and this result was to be expected. It also shows that the use of the Curva alphabet effectively resolves this issue.


Figure 12: conditional character entropy (vertical) vs. single character entropy (horizontal)

A very important conclusion from this Figure is, that the different transcription alphabets show a similar result, and this result is therefore representative for the writing in the Voynich MS. Even though the Currier alphabet and the Eva alphabet have completely different definitions of what is 'one single symbol', (and FSG and Curva are probably closest to the truth), the statistics are similar, and quite far away from those of 'normal languages'.

Vowels and consonants

The plots for the plain languages are dominated by the alternation of vowels and consonants, and the fact that such combinations are the ones that tend to have the highest frequencies. The plot for the German text "Tristan" looks somewhat different from the other three, because there are more consonants among the highest-frequency characters. It is of interest to redo the plots by separating the vowels from the consonants. For the known languages this could be easily done, but for the text of the Voynich MS we do not know if the characters can really be separated in this way.

This can be solved by applying a two-state Hidden Markov Model (HMM) to all texts. For known texts this effectively classifies all characters into vowels and consonants, though sometimes with some minor 'surprises'. The most frequent character in the languages of the four plain texts is a vowel. Therefore, for the Voynich MS text, we will call the 'vowel' state the one that includes the single most frequent character. This is the character o.

I have done this analysis based on my own, independent implementation of a two state HMM (7). For texts in known languages this converges fairly quickly and produces the same results as the more standard implementations. For the text of the Voynich MS there were clear convergence issues. The results reported by Reddy and Knight (2011) (8) suggest that this method is not successful in identifying vowels and consonants in the Voynich MS text (9), but after some experimentation it turned out that it helped to treat spaces as a separate, third state, and not to try to minimise in any way the transition probabilities between the two character states ('vowel' / 'consonant') and the space state.

The results are shown in the following. The first plot shows the expected effect on the organisation and colour distributions in these plots.


Figure 13: expected patterns of vowel and consonant frequency distributions

The following plots show the effect of the new sorting on the Mattioli text. The HMM-sorted text has all 'vowels' in decreasing frequency first, followed by all 'consonants' in decreasing frequency.



Figure 14: effect of vowel/consonant separation (by HMM) on Mattioli text (Latin)

The following plots show the effect of the application of the same procedure on all texts that have been used before.



Figure 15: effect of HMM algorithm on all 8 sample texts

The following table shows the results of the separation of characters.

Mattioli
(Latin)
"Vowels" i e a u o y
"Consonants" t s r n m c l d p q b v f g h x z k
Pliny
(Latin)
"Vowels" e i a u o y w
"Consonants" t s n r m c l d p q b v g f x h z k
Dante
(Italian)
"Vowels" e a i o u h x
"Consonants" n r l t s c d m p v g f b q z
Tristan
(German)
"Vowels" e i a h u o k y
"Consonants" n s r t d l c g m b w f z v p j q x
Voynich
(C-D)
"Vowels" O 9 C A 0
o y e a iiir
"Consonants" 8 S E F R 4 P Z M 2 N Q B X J V T W D 3 U I 6 Y K 7 G H L 5
d ch l k r q t Sh iin s in cTh p cKh m f ir cPh n iiin iir i g cFh im j il iil iim iiim
Voynich
(FSG)
"Vowels" O C G A T Z 7
o e y a ch (c h) j
"Consonants" C 8 D E R H 4 M S 2 P K I N F 0 L Y
e d k l r t q iin Sh s p m i in f n x
Voynich
(ZL - Eva)
"Vowels" o e y a c s n u
o e y a C s n u
"Consonants" h d i k l r t q p m f g x b j v z
h d i k l r t q p m f g x b j v z
Voynich
(ZL - Curva)
"Vowels" O Y A E U F
o y a e ee f
"Consonants" D S K L R T H Z M C N P J I G X B Q V
d ch k l r t q Sh iin s in p m/z i g x b j v

For the plain texts, the most obvious 'unexpected' outcome is the listing of h as a vowel in the German text. In general, and depending on settings, either c or h tends to be classified as a vowel. Looking at the plot in detail, one can see that the h (fourth row and fourth column) has very different combinations to the left and to the right. The single dark red square in the fourth row corresponds with the combination ch.

For the Voynich texts, the first thing to note is that, for the FSG transcription, the character e appears both in the vowel and the consonant state. Indeed, the algorithm converged with a 50/50 probability on this character.

Apart from that, as already indicated, the HMM algorithm had trouble converging for the Voynich MS texts. While this will still be investigated further, it can be attributed tentatively to the asymmetry of the plots, that was already observed. Characters tend to make different combinations 'to the left' than 'to the right'.

Despite these problems, the above plots shows the difference between the Voynich MS text and that of a Latin plain text even more clearly.

What did we learn from all this?

How was the Voynich MS text created?

It was already mentioned in the course of this page, that the Voynich MS text is definitely not a Latin text that has been transcribed using an invented alphabet (simple substitution). However, we should also keep in mind what has been discussed on the previous page, namely that there is a whole tree structure of possible ways in which the Voynich MS text may have been composed. An encoding of a Latin text by simple substitution is just one very specific example, that is based on several different assumptions. For the present purpose we may separate all cases into the following two larger classes:

A small digression follows, just to give an example of a 'process without a plain text'.

Texts and "monkey" processes

An interesting set of experiments in Bennett (see note 1), is what he refers to as "Monkeys" (10). These are Markov processes that generate random texts (as if a monkey was arbitrarily punching on a typewriter), though with pre-defined statistical properties. A "first-order monkey" would generate texts with a predefined character distribution. A "second-order monkey" would generate texts with a pre-defined character pair distribution.

The interesting part of this is that such an automatically generated text would be meaningless, but have exactly the same character distributions as shown in all the above plots. The first-order monkey would generate the plots that appear on the right-hand side of Figures 7 and 9. The second-order monkey would generate the left-hand side of these. As Bennett further illustrates, higher-order monkeys would generate texts that start showing some words in the language that was used to define the statistical properties of the monkey.

It is not realistic to assume that a medieval jokester would have done this. However, it just shows that there is a way to randomly generate text through some 'process', rather than converting a real plain text. If this was done, then the entropies and character distrbibutions we are observing are those of the process, so we would need to look for a process that has these properties. In other words, when experimenting with processes, we can check how close they are to generating these properties.

A text and a conversion method

Let us now look at the more interesting case that the Voynich MS text was generated by taking some plain text in some language, and applying a process to it resulting in the text we have. We don't yet know the language, and the process may have been a cipher or something else. The original text had its entropy and its character pair distribution, but the process could have modified it. This basically leaves two options:

  1. The entropy of the original text was close to that of the Voynich MS text....
  2. or it was not, but it was significantly reduced by the process.

To make matters more complicated, intermediate possibilities also exist, whereby the original entropy was already smaller (i.e. the character combinations already much reduced), and the process did not (need to) reduce it by too much. Furthermore, the 'process' may have consisted of several steps, each with its own impact on the entropy. We already know that one of these steps involved rendering the text in an invnted alphabet, but if that was a simple substitution step 'at the end', this had no impact on the statistics shown on this page, so these have to be explained by any or all previous steps.

Only three languages have been analaysed here, one of which (German) was a modern text, but it should be clear that any similar language, i.e. one with a similar alternation of vowels and consonants, will not make any significant difference. Looking again at Figure 12, we see that such a language will not bring us anywhere nearer to the Voynich MS text properties. We either need a drastically different language, or a 'conversion method' that makes drastic changes to the character distribution.
Time for another short intermezzo or two.

Latin abbreviations and their expansion

Numerous 'solutions' to the Voynich MS text have been proposed that involve expansion of abbreviations as are frequently found in medieval MS texts. This means that the text is proposed to be similar to a simple substitution of a Latin text (which we already know cannot work), but with the additional feature that certain characters of the Voynich MS text should be expanded to combinations of plain text characters. Typically, it is proposed that the Voynich MS symbol y should be translated as "us". This is consistent with actual historical usage.

Would this expansion of characters be able to explain the difference we have seen in the plots above? The answer is a very firm: no! There are several reasons for this.

The main point is a general statistical one, from information theory. Compression of a string of characters means an increase in entropy, while the inverted process (de-compression of the compressed file) a decrease in entropy. Thus, a process converting a plain text with higher entropy to the Voynich MS text is equivalent with some kind of expansion. Now, replacing Voynich MS characters with further expansion will not increase the entropy, but rather reduce it (11). Because of what has just been described here, the Voynich MS text has occasionally been called (or compared with) a verbose cipher (12).

The above explanation may be illustrated by looking more closely at the results for the ZL transcription in the Eva and Curva alphabets. Starting from the Curva transcription, the Eva transcription is exactly like a verbose cipher:

The Eva plots in Figures 11 and 15 tend to be more sporadically filled than the Curva plots, even though the difference is not very great. Likewise, expansion of the Voynich MS text by assuming that there are abbreviations will have a small, similar, effect. It will not be capable of 'filling up all the gaps', in order to arrive at something similar to the plots for plain Latin.

Written or spoken text?

All entropy calculations that have been presented are based on a written text. There have long been suggestions that the Voynich MS could be a (first) rendition in writing of a language that was never written before. In this case, the writing should more closely follow a spoken text. For many languages, the written text may differ considerably from the spoken text. Consider the phoneme which is written in Czech as š, in English as sh and in German as sch. This is 1, 2 or 3 characters for the same phoneme.

This means that not all conclusions drawn from comparisons with written texts may be valid for a close approximation of a spoken text. However, the issue already identified: the very restricted combinations of vowels and consonants, remains. Until someone finds a good way of experimenting with this, we need to keep in mind that there is also this caveat.

Looking at other types of languages

Based on what has been written above, the task of explaining the Voynich MS text and its properties may be one of the following:

  1. Find some process that generates a meaningless string of characters with the statistical properties described here
  2. Find a language that has similar properties as the Voynich MS text
  3. Find a process that can modify the properties of Latin, Italian, German (etc) into those that we have seen here

In this section we will look at option 2. While this is only one of three options, for many people this is considered the fundamental question about the Voynich MS: to find the language. Looking back at Figure 12, we see that the Voynich MS text is completely different from Latin, Italian or German. Other closely related languages and dialects will not make any significant difference, so we need to look for a very different language.

Bennett (see note 1) included a number of other languages is his analysis, but found no good match, with the exception of Hawaiian. This language is not a very likely candidate for the source language of the Voynich MS text. Reddy and Knight (see note 8) find that the predictability of characters, derived from entropy, is similar to Chinese transliterated in Pinyin. Both points indicate that there are languages that are very significantly different from Latin/English/German in terms of entropy.

A priori, some languages are more likely than others as a potential source for the Voynich MS text. Greek, Hebrew, several languages using the Arabic script, Coptic or Armenian are all reasonable candidates, none of which have been subjected to dedicated entropy analyses, to the best of my knowledge (13). These tend to have in common that they do not use the Roman alphabet, and there could be various different ways that they could be converted to the set of 24-36 symbols that we see in the MS.

This means that for some languages to be tested, some 'process' is automatically implied from the start. In some cases this may be simple, but in others not.

To be continued...

Some initial considerations about the 'process'

To be written.

Summary

This part needs to repeat the main points, and what we may conclude from them at this stage.

There are still further topics that may help us move in the right direction.

Notes

1
See Bennett (1976).
2
Using IVTT. De 0 found in the FSG files still needs to be checked further.
3
A description of the plain text files used in this page will have to be added.
4
Characters with umlauts have been converted with a trailing 3, e.g. "ä" becomes "ae".
5
As summarised here.
6
In this case, the character pair frequency for each pair is the product of the two corresponding single-character frequencies.
7
This is a purely statistical method based on the character pair frequency distribution. Experimentation with a more 'classical' HMM implementation should still be done.
8
Reddy and Knight (2011).
9
Also confirmed by earlier discussions with Jim Reeds.
10
The term was also adopted by the Voynich researcher Jacques Guy, who generated a tool to produce such automatically generated texts.
11
Since the entropy is already extremely low, there is not much scope in further reducing it, and an opposite effect can be achieved by the introduction of new character combinations. In practice, the entropy may stay at roughly the same low level.
12
Of course, the real answer is not so simple, since verbose ciphers would increase the average word length, and the Voynich MS average word length is not at all longer than that of most common languages.
13
Whereas this page has numerous other statistics.

 

Contents Home Map

Copyright René Zandbergen, 2017
Comments, questions, suggestions? Your feedback is welcome.
Latest update: 10/12/2017