Contents Home Map

Or use your browser's BACK button

From digraph entropy to word entropy in the Voynich MS

Introduction

Both first and second order entropies of the text of the Voynich MS are lower than that of Latin or English, as has been demonstrated here. On top of that, the words in the Voynich MS tend to be relatively short (1). Thus, one would expect Voynichese words to be more restricted or less diverse than words in Latin. This word diversity can be measured either by counting the number of different word types for texts of various lengths, or by computing the single word entropy from the word frequency distribution. Both statistics have some shortcomings: the number of word types is affected by spelling or transription errors and the single word entropy can only be estimated from long texts. Both statistics will be computed for texts in Voynichese and in other languages, using samples of the same length (counted in number of words) to minimise these problems.

As shown here, a long section of Voynich MS text had a single word entropy of roughtly 10 bits, just like normal language. Thus, an apparent contradiction appears. If the text of the Voynich MS uses fewer characters, with restricted variability, yet still produces a normal variety of words, then the other languages must be wasteful. This 'waste of information' will be investigated below.

Short description of the numerical analyses

The appearance of a particular item (here: one or more characters from the start of a word token, or the entire word token) at a certain point in a text is an event with a probability p(it) which is between 0 and 1. The amount of information gained if this item occurs (expressed in bits) equals:

b(it) = - 2log (p(it))

using the logarithm of base 2 (2).

Taking the average number of bits of information over all word tokens results in the standard formula for entropy. Using HW as the symbol for the single word entropy:

HW = SUM p(it) * b(it) = - SUM p(it) 2log (p(it))

Summation is over all word tokens.

The bits of information can be considered to be distributed over all characters of the word. If we take the appearance of the English word 'the' at position i of a text, the probability p('the') can be (trivially) calculated as:

p(:the_) = p(:t) * p(:th)/p(:t) * p(:the)/p(:th) * p(:the_)/p(:the)

where the colon indicates a word start (which will be omitted in the future) and the underscore denotes a space (the word end).

Each of the ratios in this long product represents a conditional probability p(A|B) (the probability of 'A' given condition 'B') counting from the start of the word.

p(the_) = p(t) * p(th|t) * p(the|th) * p(the_|the)

Taking the 2log of the previous expression gives:

b(the_) = b(t) + b(th)-b(t) + b(the)-b(th) + b(the_)-b(the)

It is more interesting to take the average of all words, and this is achieved by replacing the number of bits for one token by the corresponding entropy values.

Source texts in various languages will be analysed by computing entropy values for word-initial sequences of various lengths (I am restricting to 1-, 2-, 3- and 4-character sequences) and the full word entropy. These values will be denoted by Hw1, Hw2, Hw3, Hw4 and HW respectively. The conditional entropies are denoted as hw2, hw3, hw4 and hW and computed as the difference between two absolute entropies. Thus, the average number of bits of entropy per word and per character follow the following relationship:

HW = Hw1 + hw2 + hw3 + hw4 + hW

where the meaning of hW is the combined information contained in all characters after the fourth.

One practical consideration is how to combine the statistics of words of different lengths. To solve this for very short words (less than 3 characters), all words are considered to end with an infinite number of trailing spaces. The first space at the end of a word (e.g. the underscore used above, in 'the_') is a significant character. It tells the reader that the word will not be 'there' or 'their'. All remaining spaces contain no further information, as the probability of a space following a space equals one, and its contribution to the entropy is zero.

Results

Tests were run for both short and long texts. The need for using long texts is clear, but only short texts are available for various sections of the Voynich MS.

Number of item types and entropy values are plotted below, as a function of text length. As for full words, the number of types can also be computed for word-initial n-graphs. The following colour table applies for the short texts. The Curva transcription alphabet (a mixture of Currier and Eva) was introduced in this page. (Note that the headings say 'tokens' which by new convention should be 'types'.)

Table 1: legend to the following set of Figures
red 24 pages of herbal-A, in Curva
blue 24 pages of herbal-B, in Curva
pink Same herbal-A sample, in Eva
cyan Same herbal-B sample, in Eva
green Genesis (Vulgate)
grey De Bello Gallico (Latin)

Counts

Entropy

Some initial observations, before a more complete discussion in the section 'Conclusions' below. (Admittedly, the figures may be a bit hard to interpret. The most conspicuous part is to see the number of types grow with increasing text length, and the entropy to converge to a value. Graphs have been combined in a sub-optimal way and I hope to be able to redo them, which requires redoing all the calculations.)

It is necessary to confirm the above observations by using longer text samples. The longest consistent part in the Voynich MS is formed by the stars (recipes) section in Quire 20. It is compared (entropy only) with the following other texts:

Table 2: legend to the following Figure
green Genesis chapters 1-25 (Vulgate)
grey De Bello Gallico (Latin)
blue Stars or recipes section of the VMs, in Curva

Entropy for longer text samples

Finally, two tables which give the numerical values from the above graphs for a text with 8000 words:

Table 3: Cumulative bits of information
Characters. First: 1 2 3 4 All
De Bello Gallico 4.0891 6.3754 8.0874 8.8913 10.1625
Genesis 3.9840 6.2321 7.8513 8.5048 9.3313
VMs stars 3.2321 5.2267 7.2669 8.8186 9.9265

 

Table 4: Bits of information per character
Characters: 1st 2nd 3rd 4th Rest
De Bello Gallico 4.0891 2.2863 1.7120 0.8039 1.2712
Genesis 3.9840 2.2481 1.6192 0.6535 0.8265
VMs stars 3.2321 1.9946 2.0402 1.5517 1.1079

Interpretation

The above shows that:

The last point above is rather remarkable. William Friedman thought that the Voynich MS could be in some early form of constructed language, and the above observations are quite compatible with this hypothesis.

Also, due to the lack of success in comparing Voynichese with 'normal' European languages, there have been suggestions that the Voynich MS may be some alphabetic rendering of a syllabic language or writing system.

Both suggestions have chronological difficulties, but cannot be rejected off-hand. Sample texts for both are available, both relatively short. These are compared with Voynichese below, together with two more other unusual texts which were obtained by modifying the Vulgate Genesis or the Bello Gallico using word games explained at >> Jorge Stolfi's Web Pages.

red 24 pages of herbal-A, in Curva
blue 24 pages of herbal-B, in Curva
yellow Chinese sample in pinyin
cyan The Lord's prayer in Dalgarno's language
green Genesis modified with Gabriel Landini's dain daiin susbtitution
grey De Bello Gallico modified with Rene Zandbergen's word scrambling susbtitution

The Chinese sample has a value of Hw1 slightly above 4, which is the same as in Latin, and significantly higher than Voynichese. If has a value of HW which is significantly lower than Voynichese and Latin, showing that the language/script uses fewer tokens. Thus, the idea that Voynichese is a written version of a syllabic script does not match this simple assumption, and a more complicated method of converting the syllabic script to Voynichese will be required in order to maintain this hypothesis. The statistics for Caesar's Latin shown above, indicate that it will in fact be easier to convert Latin to Voynichese than to do this with Chinese.

The artificial language is the only text sample that exhibits values of Hw1 and hw2 similar to Voynichese. After that, the language suffers from a shortage of tokens even more severe than the Pinyin sample. Admittedly, the contents of the sample are likely to be partly responsible for this feature, but it is to be questioned whether any invented language will have a sufficiently rich vocabulary, and the same question may be asked for the vocabulary of glossolalia.

The modified Latin texts were not offered as realistic potential explanations for the anomalies in Voynichese. The 'dain daiin' substitution makes the word entropy collapse to below the level of Hw2 of Latin. The word scrambling technique is closest to Voynichese overall, but its hw2 is still significantly higher.

Conclusions

No definite conclusions can be drawn, and if certain hypotheses about the nature of the Voynich MS language seem to be contradicted, it may be possible to find more elaborate ways to match Voynichese with syllabic writing, artificial languages, glossolalia or a word game.

Still, the following has been found:

  1. The apparent words in the Voynich MS statistically appear to be really words. They are as varied as the words in Latin texts of a similar length.
  2. The first and second character of Voynich words (using the Curva alphabet) have lower entropy than in Latin. The Voynich words contain more information from the third character onwards (in the conditional sense).
  3. The word-initial statistics of Voynichese are matched by one example of an artificial langauge (which postdates the Voynich MS by two centuries).
  4. The statistics of Voynichese and a Mandarin text represented in Pinyin (using a trailing numerical character to indicate tone) are very different.
  5. A word game-type algorothm to convert Latin to Voynichese must:
    1. Increase predictability of word starts
    2. Make words shorter
    3. Maintain the length of the vocabulary

Acknowledgments

Most of the text samples used in this study were prepared by Jorge Stolfi. The sample of Dalgarno's language was prepared by Adam McLean.

Notes

1
As demonstrated e.g. by Gabriel Landini in a paper about Zipf's law. The reference is temporary missing (see here).
2
See the introduction to the entropy concept.

 

Contents Home Map

Or use your browser's BACK button

Copyright René Zandbergen, 2015
Comments, questions, suggestions? Your feedback is welcome.
Latest update: 25/12/2015