Voynich MS - Analysis Section ( 1/5 )

See also the site ROAD MAP

Home Map Pages Gloss Pics Refs
Next Short Long

Table of contents

1. Introduction
  1.1 Currier languages
  1.2 Entropy
  1.3 Zipf's laws
  1.4 Cluster analysis
  1.5 Other tools
2. Properties of characters
3. Word structure
4. Properties of words
5. Properties of word combinations
6. Search in mailing list archives (to be written)

1. Introduction

It is difficult to present the multitude of analyses that have been performed on the Voynich MS text in an orderly fashion. Some analyses concentrate on the properties and distribution of individual characters, others on those of words and others again on the combinations of words. There are also studies that take into account more than one of the above.
Some analyses are more of a qualitative nature while others are more quantitative, but there is no clear dividing line between the two.

The approach that has been adopted here is to use as the main classification:

but preceding this by an introductory section (this page) which explains the various techniques that can be applied to each of these. The reader is reminded that the analysis of the script of the MS is investigated in a previous section , and another page includes a description of the manuscript transcription effort.

Character Analysis
This includes:

Word Analysis
This is split in two parts, in separate sections. The first section treats the so-called word paradigms, a unique property of the Voynich MS 'language', whereby words appear to be 'molded' following some set rules.
The second section includes:

Syntax analysis
This includes:

Following now are some sections describing some key topics in the analysis of the Voynich MS text.

(The whole analysis section is still quite incomplete.)

1.1 Currier languages

The first thing any analyst of the Voynich MS will do is to count and make frequency tables of single characters, pairs, triplets, etc. and to do the same with the (apparent) words. When doing this, it appears immediately that some statistical properties are strongly page-dependent. This was already noticed by Th. Petersen (who used all pages), but first reported in detail by Currier (who did not use several parts of the MS for this study).

Currier indicated that the Voynich MS appears to have been written in two languages, which he called A and B. He was careful to point out that these are not necessarily different languages, but could be dialects, subject matter or different encryption, if the MS is indeed written in a code or cipher. Since Currier also detected two handwriting styles (which he called 1 and 2) and found a perfect correlation: all pages in language A were in hand 1 and language B was in hand 2, he concluded that the MS had to be the work of at least two people. In fact he suggested further hands which he called 3, 4, X and Y, but while everyone essentially agreed to his identification of languages A and B and hands 1 and 2, the other hands were not as generally accepted.

Currier presented his findings in detail during a symposium about the Voynich MS which was held on 30 November 1976 and led by Mary D'Imperio. Currier's paper has been converted into electronic form, and the complete file, including all tables, is available in PostScript on a >> page at Jim Reeds' web site (using Stolfi's mirror).

The main properties of Currier's two languages are:

The above shows that the Currier language is evident from criteria based on single characters, character groups and whole words. A particular feature of the Currier languages is that, in general, complete bifolios are written in a single Currier language.

1.2 Entropy

The entropy of the language of the Voynich MS was first studied by Bennett (1976), and when he found rather anomalous values for the Voynich MS text, compared with most European languages (old and new), this became one of the main topics for subsequent investigations. The meaning of entropy is therefore introduced in some detail here. Note that this is not a very formal mathematical introduction, but mainly one aiming at allowing the reader to undertand the various analyses that use it.

Entropy is a quantity that could be interpreted as amount of 'chaos' or unpredictability, in the sense that lower values of entropy are equivalent with higher amounts of order or predictability. If a string of characters has full predictability, it carries no information. Once one knows the first character, one can predict all subsequent ones: one knows everything. The entropy of a piece of text is therefore also a measure of the amount of information it carries. The entropy values used in the study of the Voynich MS text are usually expressed in bits (of information).

This is best elaborated using a simple example.

Imagine someone were to create a string of numbers using a die. He would roll the die, write down the top face number, and repeat this process as long as he wanted to. The number that appears each time (at each event) is a piece of information. The amount of information gained is inversely proportional to the probability of the event. If the die is a perfect 6-face die, then the probability (p) of throwing a '1' is 1/6. The number of bits of information (b) gained at this event is the 2-base logarithm of 1/p, or:

  b = - 2log(p)
On average, the number of bits of information gained at the appearance of each number (which is the entropy, denoted here by a capital H) is the weighted average for each possible outcome:
  H = Sum { p . b }  =  - Sum  { p . 2log(p) }

In our example of a string of text (whether digits or characters) generated by throwing a perfect 6-face die, the entropy on a single-character basis is:

  H(char) = - Sum { 1/6 . 2log( 1/6 ) } = 2log ( 6 )
Had the die not been perfect, but weighted, the six probabilties would not have been all equal 1/6, and the resulting entropy value would have been lower than the above value. (A mathematical proof of this is straightforward but outside the scope of this page). The die is a bit more predictable (has a tendency towards the most probable number) and the entropy is lower.

Had the die not had 6 sides but any number N, the maximum entropy (in case all probabilities are equal) would have equalled 2log ( N ).

The main thing to be remembered from the above is that the entropy is a value that can be computed for something which can assume a number of values, and the sum of the probabilities for each of the values is one. One could use the index i to denote each of the values, and p(i) the probability that the 'thing' has value i. The formula for the entropy of this is:

  H =  - Sum  { p(i) . 2log [p(i)] }

For a piece of text, the single-character entropy can be computed using the probabilities of the occurrence of each of the characters of the alphabet. For a text written in an alphabet of 26 characters, this entropy will be less than 2log(26) or 4.xxx, knowing that not all 26 characters will appear equally frequently.

An important distinction is that the single-character entropy of any language can only be approximated by performing the above calculation. Apart from the fact that the entropy will depend on the subject matter and the writing style of the author, it is clear that texts which are not long enough will tend not to show the correct probabilities especially for those characters which occur relatively infrequently.

Entropy can also be computed for words rather than single characters. A text written in a vocabulary of 10,000 words will have a word entropy less than 2log(10000) or less than 13.288, depending on the distribution of the word frequencies. It can furthermore be computed for character pairs. Going back to the example of throwing the die, there are 36 possibilities for a pair of throws. The entropy for this (assuming a perfect die) is 2log(36) which is 2 times 2log(6). It is evident that if the occurrence of the two consecutive events is independent, the 'pair' entropy equals twice the 'single event' entropy.

In natural language, however, the occurrence of a character in a text is not independent from what the previous character was. For example, in English the probability of encountering the character 'u' depends highly on whether the previous character was a 'q' (in which case the probability is essentially 1), another 'u' (in which case it would be very close to zero) or anything else. This introduces the concept of conditional entropy. It can be shown mathematically that the conditional single-character entropy (the entropy of the probability distribution of a single character, given that the preceding one is known) equals the difference between the character pair (=digraph) entropy and the single character entropy. This conditional character entropy is less than the 'normal' character entropy.

A final word should be spent on terminology. Single-character entropy is sometimes called first-order entropy. Character-pair entropy is sometimes called second-order entropy, while the conditional single-character entropy is also sometimes called second-order entropy. The values given for these quantities should remove any doubt about what is meant, since the conditional second-order entropy is always less than the single-character entropy, which is always less than the character pair entropy.

1.3 Zipf's laws

Zipf's law (strictly the first Zipf law) concerns the frequency of words in a piece of text. If one orders the words according to decreasing frequency, i.e. label the most frequent word as nr.1, the second most frequent word as nr.2, etc, and then make a plot of the frequency of this word according to the rank, the result should show a straight line with a gradient of -1, if both scales are logarithmic.

This general statement requires some elaboration:

The straight line in a double-logarithmic scale means that the probability for the item ranked at nr. i equals:

  p(i) = C / i
where C is a constant depending on the number of items N, and it is defined by the fact that the sum of all probabilities has to equal 1. Thus, if a quantity can assume a well-defined number of values, and it strictly obeys Zipf's law, its entropy can be predicted exactly.

Following is a table which illustrates this. The first column gives the number of possible values. The second gives the maximum entropy, if all probabilities are equal. The third column gives the entropy if the quantity exactly obeys Zipf's law. For example, the value 26 represents the number of letters in the (Latin) alphabet. If they are all equally frequent (which they are not), the character entropy would equal 4.700. If they exactly followed Zipf's law (which is also not true, but certainly closer to the truth), the character entropy would equal 3.929. The table has been set up for reasonable values of alphabet size, number of digraphs and number of words in a text.


  Number     H(max)   H(Zipf)
 
    16       4.000     3.403
    17       4.087     3.470
    18       4.170     3.532
    19       4.248     3.591
    20       4.322     3.647
    21       4.392     3.700
    22       4.459     3.750
    23       4.524     3.798
    24       4.585     3.844
    25       4.644     3.887
    26       4.700     3.929
    27       4.755     3.969
    28       4.807     4.008
    29       4.858     4.045
    30       4.907     4.081
    31       4.954     4.116
    32       5.000     4.149
    33       5.044     4.181
    34       5.087     4.213
    35       5.129     4.243
    36       5.170     4.273
    37       5.209     4.301

   100       6.644     5.310
   200       7.644     5.986
   500       8.966     6.851
  1000       9.966     7.489
  2000      10.966     8.115
  5000      12.288     8.927
10,000      13.288     9.532
20,000      14.288    10.130
50,000      15.610    10.911

(To be completed)

1.4 Cluster analysis

Cluster analyses have been applied in order to find out more about the Currier languages. Typically, this method requires that for each page of the MS a quantitative attribute is found for each page, consisting of a number, or rather a set of numbers. Next, it requires the definition of a 'distance' function which takes any pair of attributes and computes a distance value which should be low if the attributes are similar and high if they are dissimilar. Such quantitative values and their distances can be based on single characters, digraphs or words.

The most difficult task is then to decide, on the basis of the square matrix of distance values, which pages are similar (e.g. written in the same language) and which are not.

1.5 Other tools

To be included:

Home Map Pages Gloss Pics Refs
Next Short Long

Copyright René Zandbergen, 2002
Comments, questions, suggestions? Your feedback is welcome.
Latest update: 2002/10/03