Contents Home Map

Special Topics: Analysis of the text

Definition of the "Curva" analysis alphabet


In general, text analyses are made difficult by our uncertainty what is a single character in the Voynich script, and in fact such analyses may help in answering that question. The Eva transcription alphabet, which has been defined here, was specifically targeted to allow a transcription of the text of the Voynich MS in a simple but consistent form. Its design does not make it suitable for text analyses. It represents some Voynich character shapes that are most probably meant to be single characters by two or even more transcription characters (for example 'ch' for ch and 'iin' for iin ). Note that the Currier alphabet quite probably has the opposite problem.

In order to do some types of text analyses (in particular those where it is important to identify single characters), an alphabet derived from Eva will be used, for transcriptions based on Eva. While there are many possible choices, and it is worth experimenting with different options, for consistency I will here use a single example, which is in a way a mixture of Eva and Currier, which I will therefore refer to as 'Curva'.

Note that this is not intended to represent a good transcription of the Voynich MS text. It is made purely for reasons of text analysis.

Curva alphabet table

  CurrierEvaCurva   CurrierEvaCurva   CurrierEvaCurva
q 4 q H cFh Y cfh FS im K im IJ
o O o O a A a A iim L iim NJ
d 8 d D e C e E iiim 5 iiim MJ
y 9 y Y i I i I g 6 g G
s 2 s C il G il IL j 7 j Q
l E l L iil H iil NL x   x X
ch S ch S iiil 1 iiil ML v   v V
Sh Z sh Z ir T ir IR r R r R
t P t T iir U iir NR b   b B
p B p P iiir 0 iiir MR u   u A
k F k K n D n I z   z J
f V f F in N in N ee CC ee U
cTh Q cth TS iin M iin M eee CCC eee UE
cPh W cph PS iiin 3 iiin NN eeee CCCC eeee UU
cKh X ckh KS m J m J        

How many characters, how many words?


The most elementary statistics of the Voynich MS text are the number of characters and the number of words in the MS. No authoritative figures for this have been published, but based on the available transcriptions, they will be derived here. Both numbers are uncertain, because of the uncertainty in the definition of the character set in the MS, and the uncertainty about word spaces. Both statistics can be derived from transcription files, the completeness of which has been reported here. The GC and ZL files are the most suitable for this purpose.

How many characters?

The ZL transcription is complete, but the Eva alphabet in which it has been expressed is not suitable for counting characters. We will come back to it below. The GC transcription is 99.6% complete. It lacks only 22 loci, all of type "L" (label or single word). The v101 transcription alphabet it uses has been designed to represent what its designer considered single characters by single transcription characters. Counting the symbol for unknown characters: "?" as a single character, and not counting certain or uncertain spaces, this transcription includes 158,959 characters. For the 22 missing labels we may add an estimated 130 characters, bringing the estimated total to 159,089 which we may round to 159,100.

To count characters in the ZL transcription file, all alternate readings are resolved to the first option, and ligature brackets { } are removed. The character count using Eva is 194,570 . In this count, all unreadable characters are counted as one, while unknown sequences of unreadable characters have been counted as three. In any case, this number is not particularly useful. To obtain a better count, the text may be converted to the Curva alphabet as introduced above, while deleting all 'quote' symbols. In this case, the count of characters becomes 166,232 .

The difference between the two numbers is rather large: about 7000, which is just over 4%. This is primarily due to the fact that all pedestalled gallows characters in the GC transcription are counted as one character, while in the ZL transcription they are counted as two or more. Since we don't know what is the truth, the range of 160,000 to 165,000 may be considered our present best guess.

How many words?

This question concerns the number of word tokens. That is, if any word occurs 100 times in the MS, it is also counted as 100. The count is made difficult by the uncertainty of word spaces, but since both the GC and ZL files include 'certain' and 'uncertain' spaces, we can obtain two counts for each file, and see how different they are.

To count words, we should exclude loci of which we may argue that they do not represent words, but rather single characters, which typically appear in sequences. The exclusion criteria are listed here, for transparency, and so that others may repeat these counts. The following table lists the loci that have been excluded in the counts, and the reasons for this.

Fol.LociNr.Reason In GC file?
f1r all Lx 3 Individual characters in the margin, in a later hand. No / 0
f11v 5,*L0 1 Two characters in the left margin No / 0
f17r @Lx 1 Marginal writing not counted No / 0
f49v All L0 26 Individual characters Yes / 26
f57v 3,@Cc 1 Individual characters Yes / 1
f57v 5,@Cc (1) Only the part that consists of single characters Yes / (1)
f66r 16-49 34 Single characters Yes / 34
f75v All L0 6 Single characters Yes / 6
f76r All L0 9 Single characters Yes / 9

The following table gives the different counts:

Transcription Counting
uncertain spaces
uncertain spaces
GC 40,530 38,071
ZL 38,805 36,072

Again, the difference is rather large. This difference is not explained by the different transcription alphabets used, but by the different interpretation of what are word spaces. GC "sees" more spaces, and the complete set of "certain" spaces in the GC transcription is roughly the same as the combination of "certain" and "uncertain" spaces in ZL. Overall, GC has about 5% more words in both cases. Just to present an indicative number, one may say that the Voynich MS includes roughly 37,000 - 39,000 word tokens (words).

How many different words?

This question concerns the number of word types. That is, if any word occurs 100 times in the MS, it counts as 1. The number is again computed for the above four cases.

Transcription Counting
uncertain spaces
uncertain spaces
GC 9,814 10,553
ZL 8,412 9,467

If one considers the uncertain spaces as word spaces, then, of course, there are more word tokens in the transcription than if one counts only the certain ones, as already observed above. Interestingly, however, in that cases there are at the same time fewer word types (different words). This rather surprising result is found both for the GC transcription and the ZL transcription. Again, the numbers are quite different. In this case, the ratio between GC and RZ is well over 10%. This increase in the difference is caused by the transcription alphabet. As a typical example, GC considers several different forms of the sh character, giving rise to a larger number of different words.

A representative number of word types may be 9,000 - 10,000.



Contents Home Map

Copyright René Zandbergen, 2017
Comments, questions, suggestions? Your feedback is welcome.
Latest update: 23/09/2017