Currier A and B: two different languages?


Capt. Prescott Currier proposed that the pages of the Voynich Manuscript are written in two different 'languages', which he called 'A' and 'B' (1). Whether these are in fact two different languages is an open point. Alternative possibilities are that the MS is in one language, but the differences are due to different dialects or even just different subject matter. A third possibility exists if the Voynich text is in code. In that case the differences may still be due to one of the above differences in the plaintext, or otherwise the encryption method leaves some freedom and different personal styles are responsible for the difference.

An important aspect is the fact that the pages written in language A also show a different hand than the ones written in language B. One may assume they were either written by at least two different people, or by one person at different times.

If one accepts the spaces in the Voynich manuscript as separator (for words, syllables or anything else), the text may be broken up into small character groups which will be called words or word tokens in the following. When two words are the same, we will call them of the same word type.

In the following, the word types in the A and B sections will be studied in order to draw some conclusions about the differences between the two 'languages'. Some of the results shown here have also been deduced from other statistics and presented by members of the early Voynich Manuscript mailing list. A similar study was also made by Mary D'Imperio, looking at the single character frequency distribution (2).

Frequent words

The most common word type in the A corpus is daiin or in Currier notation: 8AM (3). In the B corpus, the most frequent word is chedy, or in Currier: SC89. A peculiar fact is that the latter word does not occur at all in the A corpus, whereas daiin is relatively frequent in B as well. In general, words tend to either occur in both languages or they occur in 'B' only. There is a very short list of moderate-frequency words that occur only in language A. Had the split between A and B languages coincided with the apparent split in subject matter (apparent from the illustrations), this would make sense. This is not the case, however, since the herbal illustrations have text in both languages, so we do have a reason to question whether the text is related to the illustrations at all.

Correlation of text on all pages

The data used in the analysis was taken from the >>raw FSG transcription available at Jim Reeds' web site (4). The text has been converted to the Currier alphabet (which does not alter the statistics) and split into single files per page, using Th.Petersen's page numbering scheme (nrs 1 to 234) (5). For each page, all words were sorted alphabetically. The correlation between two pages was defined as the number of word tokens common to both pages, i.e. if any word types occurred several times on one page, each occurrence was counted. The following example may explain this more clearly:

Page 1:   Ape Ape Bear Cat Cat Cat
Page 2:   Ape Ape Ape  Boar Cat

The number of common words is three: two times Ape and one Cat. Obviously, the number of common words depends heavily on the number of words on each page. Since the number of words per page is highly variable (and correlated with the language used, B pages being much more verbose), a normalisation factor needs to be used. This factor was chosen as a constant divided by the square root of the product of the number of words on the two pages being compared. This may not be a perfect method, and suggestions for finding a better 'rule' would be appreciated.

In the following figure, the correlation between any pair of pages is displayed in colour. Black means no words were available (page missing in FSG transcription). The colour scale is (low to high correlation): blue-green-yellow-orange-red-magenta. Page 1 (f1r) is on the top left of the graph, while page 234 (f116r) is at the bottom right.

Discussion of the most obvious features

The most prominent features are some bright squares of highly-correlated pages, in the lower right corner of the figure. The brightest square (magenta and red) is the biological section, which is known to be the most consistent piece of text (in B language). We may call this variety Bio-B.

The square in the corner represents the recipes (stars) section, also in B language. It shows a little more variety, especially when looking at the correlation between the stars and biological section. Since it has been suggested that the stars section may have one paragraph for every day of the year, one might venture to think that there is a seasonal effect to be observed here. To test this, the same figure has been made, not basing statistics on all words of the page, but the first 200 only, thus eliminating the need for the uncertain normalisation factor. The result is given below.

From this figure it seems more likely that the observed feature is page-related. The section comprises pages 212 through 234 (where 235 is f116v, not included), or ff. 103-116 (with 109 and 110 missing). The pages that also appear to be in Bio-B 'dialect' are those of ff. 103, 107, 108, 111,112 and 116, which are three bifolia. The other three bifolia with ff. 104, 105, 106, 113, 114 and 115 use a more varied vocabulary, but are not more correlated with A-language pages.

The astronomical and cosmological sections are not well represented in the FSG transcription, and no conclusion can be drawn about them. The herbal section (before the biological section) and the mixed herbal/pharmaceutical section (between the biological and stars section), show the expected checkerboard pattern for A and B pages. There is no clear evidence for any pages in a third language, which would be visible as a set of pages with low correlation to both A and B pages. There is, however, one other interesting observation to be made: pages 183-185 (all in <f89>) show a correlation with both A and B pages.


The next step is to quantify more precisely in what way the A and B languages differ from each other, and what are the differences or commonalities between the various 'types' of B language for the different sections. Herbal-B and Bio-B are quite different and half the recipes section is different again. While for the biological and stars sections there is a clear correspondence between the illustrations and the 'language', in the herbal and pharma sections this is not the case. Based on these findings, the following page families or clusters may be tentatively defined:

  1. Herbal-A
  2. Pharma-A
  3. Herbal-B
  4. Stars-B (low correlation with Bio-B)
  5. Stars-Bio (high correlation with Bio-B)
  6. Biological-B

The subdivision of all pages in these clusters was based on the illustration indications in Jim Reeds' >>checklist which is essentially derived from a table in D'Imperio (6). Note that apparently for all pharmaceutical pages the language has been identified as A. Additionally, the split in the recipes section was made on the basis of the above figures.

In order to quantify the differences between all these dialects, as we may call them, for these six clusters the most frequent words are listed below (using the Currier transcription), with absolute counts (and the total number of counted words per cluster given in the heading):

 Herbal-A   Pharma-A     Herbal-B   Stars-B     Stars-Bio    Bio-B
 (7975)     (2234)       (3335)     (5251)      (5483)       (6696)
 423 8AM    116 8AM      88 8AM     107 AM      136 4OFCC9   254 ZC89
 224 SOE     49 SOE      63 SC89    104 8AM     125 4OFAM    214 SC89
 154 SOR     37 SCOE     56 OR       85 4OFAM   116 SC89     194 4OFAM
 101 ZOE     36 8AE      53 S89      81 SC89    102 4OFCC89  190 OE
  98 Q9      32 OE       50 8AR      66 OPAM    101 8AM      164 4OFC89
  95 S9      27 SC9      41 AM       66 OFAM     99 AM       148 4OFCC89
  94 ZO      26 OFCOE    41 4OFC89   66 AR       98 OFAM     130 4OE
  93 89      25 AM       40 AR       60 AE       93 SC9      112 8AM
  88 2       24 4OFCOE   36 ZC89     43 4OPAM    83 ZC89     108 4OFAE
  78 8AN     23 OFCC9    36 SX9      41 SCC9     77 ZC9       96 SC9
  75 8AR     23 2AM      35 OFAM     39 SOE      72 OFCC9     95 ZC9
  63 ZOR     22 SOR      31 OFAR     39 OE       65 AR        78 4OFCC9
  61 Z9      21 OR       31 OE       36 ZC89     62 OE        76 8AE
  56 SC9     21 4OFCC9   30 89       35 4OFCC89  62 AE        73 2AM
  52 QOE     20 SCOR     28 4OFAR    34 SCO      57 OPAM      72 8AR
  51 OR      20 SCC9     26 OFC89    34 SC9      52 2AM       61 4OF9
  50 4OPS9   20 8AR      26 2AM      34 RAM      46 OPCC9     60 OR
  48 8AE     19 ZC9      25 8AE      33 FAM      43 OPC89     57 ESC89
  44 OE      17 ZCOE     24 OFAE     32 OR       42 4OFC89    56 OFAM
  43 8OE     17 SCO89    22 OPC89    31 8AR      38 RAM       53 4OFAN
  42 QOR     16 8OE      20 SCF9     30 OPAR     38 OPAE      53 2OE
  39 ZC9     16 4OFOE    20 4OFAM    27 2AM      38 FAM       51 4OPC89
  39 4OFS9   16 4OFCO89  19 OPAR     26 EFAM     38 ESC89     49 OPC89
  38 OP9     15 XC9      18 SC9      26 4OPAR    38 EFAM      49 89
  35 SAM     14 ZCC9     18 FAR      26 4OFSC89  37 SX9       45 4OFC9
  35 8AJ     14 8AT      17 Z89      25 S89      37 OFCC89    42 SX9
  34 SCOR    14 89       17 OF9      25 4OFAR    37 4OFC9     42 OFC89
  34 QC9     12 OFOE     17 8AJ      24 SCOE     36 OPCC89    42 4OFAR
  34 8OR     10 SO89     16 OFCC89   24 SCO89    36 4OFAE     41 ZCC89
  33 SO      10 SCO      15 OPAE     24 8AT      35 SOE       40 SCC9
  33 OF9     10 OFC9     15 FS89     23 4O8AM    34 SCC9      39 OPAM
  33 2AM     10 9FCC9    15 9FAR     22 OFAR     32 ZCC9      39 AM
  30 SO89    10 8AJ      15 4OFS89   21 SC8AM    32 SCOE      36 4OPCC89
  29 OPS9    10 4OFC9    14 O89      21 OPAE     31 OFC89     34 4OPAM
  29 OPOE    10 2        13 SOE      21 OFAE     30 4OPAM     33 OFCC89
  29 FS9      9 ZOE      13 SO89     21 AT       26 OFAE      32 ZCC9
  28 4OP9     9 QC9      13 ORAM     21 4OFCC9   26 FCC89     31 ZX9
  27 X9       9 OFAE     13 OPAM     20 OPCC89   25 EFCC89    30 SCOE
  27 OPAM     9 OEAM     13 FC89     19 ZCC9     23 OR        30 4OPAE
  27 OFAM     9 FOE      12 PC89     18 ORAM     23 EFCC9     29 8OE

There is a great deal of information contained in these lists, and the following tentative conclusions may be drawn:

Suggestions for further study

The interpretation of the above data is highly incomplete. To make interpretation of the figure more easily possible, it has been redrawn with the pages sorted 'per cluster' (not included here). It shows that the Herbal-A pages located after the biological section are rather different from the early Herbal-A pages, a feature that should be investigated further.

The experiment should be repeated for texts in known languages. Unfortunately, it is not easy to obtain a page-oriented text as we have of the Voynich manuscript. Instead, different texts should be cut into reasonable-length sections. This could be done for various books of the Vulgate bible, where the language is the same throughout, but the subject-matter can be very different. Based on the result, other experiments, e.g. comparing Latin with French (of which the most frequent word, et, is the same).

It should be very interesting to verify the distribution of certain words over the pages of the Manuscript (e.g. the distribution of all -C89 (-edy) words in the B sections or that of 40FCC9 (qokeey) in Stars-Bio.)

Furthermore, the missing sections need to be looked at as well. At first glance, they seem more B-like. Additionally, specific tests for the labels and scattered writing in the Manuscript could be added. This will only be possible when the complete transcription is available.


For his papers, see here (main text only) or here (all)
Mary E. D'Imperio: An Application of Cluster Analysis and Multiple Scaling to the Question of 'Hands' and 'Languages' in the Voynich Manuscript, 28 January 1992.
For the Currier transcription alphabet, see here.
Jim Reeds: >> Voynich Manuscript. WWW pages (mirror by Jorge Stolfi).
For the Petersen page numbers see here.
D'Imperio (1978).


Copyright René Zandbergen, 2016
