Contents Home Map

Special Topics: Analysis of the text

Definition of the "Cuva" analysis alphabet


In general, text analyses are made difficult by our uncertainty what is a single character in the Voynich script, and in fact such analyses may help in answering that question. The Eva transliteration alphabet, which has been defined here, was specifically targeted to allow a transliteration of the text of the Voynich MS in a simple but consistent form. Its design does not make it suitable for text analyses. It represents some Voynich character shapes that are most probably meant to be single characters by two or even more transliteration characters (for example 'ch' for ch and 'iin' for iin ). Note that the Currier alphabet quite probably has the opposite problem.

In order to do some types of text analyses (in particular those where it is important to identify single characters), an alphabet derived from Eva will be used, for transliterations based on Eva. While there are many possible choices, and it is worth experimenting with different options, for consistency I will here use a single example, which is in a way a mixture of Eva and Currier, which I will therefore refer to as 'Cuva'.

Note that this is not intended to represent a good transliteration of the Voynich MS text. It is made purely for reasons of text analysis.

Cuva alphabet table

  CurrierEvaCuva   CurrierEvaCuva   CurrierEvaCuva
q 4 q H cFh Y cfh FS im K im IJ
o O o O a A a A iim L iim NJ
d 8 d D e C e E iiim 5 iiim MJ
y 9 y Y i I i I g 6 g G
s 2 s C il G il IL j 7 j Q
l E l L iil H iil NL x   x X
ch S ch S iiil 1 iiil ML v   v V
Sh Z sh Z ir T ir IR r R r R
t P t T iir U iir NR b   b B
p B p P iiir 0 iiir MR u   u A
k F k K n D n I z   z J
f V f F in N in N ee CC ee U
cTh Q cth TS iin M iin M eee CCC eee UE
cPh W cph PS iiin 3 iiin NN eeee CCCC eeee UU
cKh X ckh KS m J m J        

How many characters, how many words?


The most elementary statistics of the Voynich MS text are the number of characters and the number of words in the MS. No authoritative figures for this have been published, but based on the available transliterations, they will be derived here. Both numbers are uncertain, because of the uncertainty in the definition of the character set in the MS, and the uncertainty about word spaces. Both statistics can be derived from transliteration files, the completeness of which has been reported here. The GC and ZL files are the most suitable for this purpose.

How many characters?

The ZL transliteration is complete, but the Eva alphabet in which it has been expressed is not suitable for counting characters. We will come back to it below. The GC transliteration is 99.6% complete. It lacks only 22 loci, all of type "L" (label or single word). The v101 transliteration alphabet it uses has been designed to represent what its designer considered single characters by single transliteration characters. Counting the symbol for unknown characters: "?" as a single character, and not counting certain or uncertain spaces, this transliteration includes 158,959 characters. For the 22 missing labels we may add an estimated 130 characters, bringing the estimated total to 159,089 which we may round to 159,100.

To count characters in the ZL transliteration file, all alternate readings are resolved to the first option, and ligature brackets { } are removed. The character count using Eva is 194,570 . In this count, all unreadable characters are counted as one, while unknown sequences of unreadable characters have been counted as three. In any case, this number is not particularly useful. To obtain a better count, the text may be converted to the Cuva alphabet as introduced above, while deleting all 'quote' symbols. In this case, the count of characters becomes 166,232 .

The difference between the two numbers is rather large: about 7000, which is just over 4%. This is primarily due to the fact that all pedestalled gallows characters in the GC transliteration are counted as one character, while in the ZL transliteration they are counted as two or more. Since we don't know what is the truth, the range of 160,000 to 165,000 may be considered our present best guess.

How many words?

This question concerns the number of word tokens. That is, if any word occurs 100 times in the MS, it is also counted as 100. The count is made difficult by the uncertainty of word spaces. Both the GC and ZL files indicate "certain" and "uncertain" spaces, so we can obtain two counts for each file. Since the results show a signficant difference, we will also use the TT transliteration file for this purpose. As shown elsewhere, this file includes 'only' 96.8% of all loci, but the missing loci tend to be short (mostly singles words), so the statistics will be only marginally affected by the roughly 200 words (which may be expected to be rare words). The TT file does not indicate "uncertain" spaces.

To count words, we should exclude loci of which we may argue that they do not represent words, but rather single characters, which typically appear in sequences. The exclusion criteria are listed here, for transparency, and so that others may repeat these counts. The following table lists the loci that have been excluded in the counts, and the reasons for this.

Fol.LociNr.Reason In GC / TT file?
f1r all Lx 3 Individual characters in the margin, in a later hand. No / 0
f11v 5,*L0 1 Two characters in the left margin No / 0
f17r @Lx 1 Marginal writing not counted No / 0
f49v All L0 26 Individual characters Yes / 26
f57v 3,@Cc 1 Individual characters Yes / 1
f57v 5,@Cc (1) Only the part that consists of single characters Yes / (1)
f66r 16-49 34 Single characters Yes / 34
f75v All L0 6 Single characters Yes / 6
f76r All L0 9 Single characters Yes / 9

The following table gives the different counts:

Transliteration Counting
uncertain spaces
uncertain spaces
GC 40,530 38,071
ZL 38,805 36,072
TT - 36,940

The differences are rather large. This difference is not explained by the different transliteration alphabets used, but by the different interpretation of what are word spaces. GC "sees" more spaces than ZL and TT, and the complete set of "certain" spaces in the GC transliteration is roughly the same as the combination of "certain" and "uncertain" spaces in ZL. Overall, GC has about 5% more words in both cases. Just to present an indicative number, one may say that the Voynich MS includes roughly 37,000 - 39,000 word tokens (words).

How many different words?

This question concerns the number of word types. That is, if any word occurs 100 times in the MS, it counts as 1. The number is again computed for the above four cases.

Transliteration Counting
uncertain spaces
uncertain spaces
GC 9,814 10,553
ZL 8,412 9,467
TT - 8,545

If one considers the uncertain spaces as word spaces, then, of course, there are more word tokens in the transliteration than if one counts only the certain ones, as already observed above. Interestingly, however, in that case there are at the same time fewer word types (different words). This rather surprising result is found both for the GC transliteration and the ZL transliteration. Effectively, it means that the indicated "uncertain" spaces in both files tend to split longer words into shorter words that appear with some frequency.

Again, the numbers are quite different. In this case, the ratio between GC and ZL is well over 10%. This increase in the difference is caused by the transliteration alphabet. As a typical example, GC considers several different forms of the Sh character, giving rise to a larger number of different words. A representative number of word types may be 9,000 - 10,000.



Contents Home Map
Copyright René Zandbergen, 2019
Comments, questions, suggestions? Your feedback is welcome.
Latest update: 25/05/2019