Last edited on 1997-12-10 09:31:24 by stolfi
This note describes an intriguing decomposition of Voynichese words into three parts --- prefix, midfix and suffix.
The decomposition is based on a partition of the Voynichese (specifically, EVA) alphabet into two sets,
With these definitions, we find that almost every Voynichese word can be decomposed into a prefix and suffix composed entirely of soft letters, and a midfix or kernel composed entirely of hard letters. For instance, the popular word qoteedy is decomposed into prefix qo-, midfix -tee-, and suffix -dy.
Any of these three elements may be empty. When the midfix is empty (i.e. the word consists entirely of soft letters), the division into prefix and suffix is ambiguous; in that case I will call the whole word an unifix.
I have analyzed under this paradigm the words in the "biological" section, f75r--f84v. Below are the most common components of each class, and their counts. (The dots are not word spaces, but marks to highlight the fine structure discussed further on):
freq prefix freq midfix freq suffix freq unifix ---- -------- ---- -------- ---- -------- ---- -------- 1859 - 824 -k- 1728 -dy 186 ol 1296 qo- 588 -che- 1239 -y 126 qol 607 o- 514 -she- 422 -aiin 106 daiin 255 ol- 387 -kee- 254 -al 71 dal 209 l- 354 -t- 245 -ol 64 dar 108 y- 347 -ke- 157 -ar 56 saiin 75 d- 179 -te- 86 -ain 55 or 45 r- 121 -ch- 66 -or 50 sol 36 qol- 113 -tee- 51 -d 48 dy 29 s- 105 -shee- 36 -s 36 aiin 23 q- 95 -chee- 28 -dar 28 dol 21 sol- 83 -sh- 25 -dal 25 oly 12 dy- 58 -pche- 21 -am 21 lol 8 sal- 49 -chckh- 20 - 21 sal 7 so- 38 -kch- 20 -aly 18 ar 6 dal- 38 -p- 16 -a 18 iin 6 olo- 33 -tche- 16 -l 18 raiin 5 a- 31 -sheckh- 13 -oldy 17 sor 5 dol- 28 -tch- 12 -daiin 15 al 4 al- 27 -kche- 10 -air 15 sar 4 lo- 26 -chcth- 10 -ary 14 s 4 or- 25 -checkh- 10 -r 13 olor 3 oqo- 25 -shckh- 9 -aldy 12 olol 3 qod- 24 -shek- 7 -as 12 rol 2 dl- 22 -kshe- 6 -ady 11 m 2 do- 20 -ee- 6 -alor 11 ral 2 lol- 17 -checth- 6 -dol 11 y 2 olol- 17 -chek- 6 -dor 10 lor 2 qoqo- 17 -tshe- 6 -oiin 10 oldy 2 qor- 16 -pch- 6 -sdy 10 r 2 rol- 15 -cth- 6 -sy 9 dain 1 alo- 14 -cthe- 5 -o 9 olaiin 1 aro- 14 -fche- 5 -oly 8 dam 1 dar- 12 -chckhe- 4 -alol 8 ldy 1 dor- 12 -shcth- 4 -an 8 ly 1 ld- 11 -ckhe- 4 -dam 8 ory 1 od- 11 -keee- 4 -m 8 qor 1 odd- 10 -shckhe- 4 -ody 7 l 1 oll- 10 -shecth- 3 -ay 7 orol 1 oro- 9 -cheek- 3 -ydy 7 qoly ... ... ... ... ... .... ... ... ---- -------- ---- -------- ---- -------- ---- -------- 4666 TOTAL 4666 TOTAL 4666 TOTAL 1516 TOTAL
You can get my notebook file with the detailed procedures I used, and the data files mentioned therein. In particuler, you can get a file containing all good words of the biological section, already factored as above (63 KB).
An unexpected feature of this decomposition is that there is a surprisingly small number of prefixes and suffixes with significative frequency. As can be sen from the above table, the distribution of any of these components falls off quite abruptly.
Another non-trivial feature of this decomposition is that virtually words have midfixes that consist entirely of hard letters. The exceptions are quite rare (see below). If the words were random strings, we would expect a substantial number of words with hard-soft-hard sequences.
There are 74 distinct anomalous (soft-containing) midfixes, or 88 if we count repeated occurrences. Here they are:
4 -polche- 1 -eat- 1 -palk- 1 -shocphe- 4 -shok- 1 -eedee- 1 -palshe- 1 -shoe- 3 -kede- 1 -eese- 1 -pdalsh- 1 -shoksh- 3 -polsh- 1 -kalch- 1 -pockh- 1 -shot- 2 -chedche- 1 -keedyqok- 1 -pok- 1 -talshe- 2 -chok- 1 -keeylshe- 1 -poldak- 1 -tchdolt- 2 -polch- 1 -keeyshe- 1 -poldshe- 1 -tchot- 2 -talsh- 1 -keylch- 1 -polk- 1 -teae- 1 -cheak- 1 -kok- 1 -polkee- 1 -tedee- 1 -chedch- 1 -kolch- 1 -polshe- 1 -teyte- 1 -chedyk- 1 -kolche- 1 -poltesh- 1 -tocthe- 1 -cheok- 1 -kolk- 1 -porshe- 1 -tok- 1 -cheolch- 1 -kolsh- 1 -psche- 1 -tolke- 1 -chlchpshee- 1 -kop- 1 -pyke- 1 -torolsh- 1 -cholche- 1 -korch- 1 -shecthedch- 1 -tot- 1 -cholkeee- 1 -kot- 1 -sheok- 1 -tsheokee- 1 -chop- 1 -kych- 1 -sheyk- 1 -tyot- 1 -chot- 1 -kylk- 1 -shockh- 1 -tyqok- 1 -chytee- 1 -palch-
Note that the 24 root occurrences that begin with p are listed here only because I assumed that that p was always a "hard" letter. But we have conjectured before that p is sort of a "joker"---probably an "ornate capital" that can be used for several distinct letters, much as the "gallows" in Cappelli's illustration.
So the ps above may well be soft letters, perhaps ds or qs. In that case they should have been parsed as part of the prefix--leaving a kosher hard-only midfix.
So we are left with 64 occurrences of truly anomalous words. That is only 1% of the sample words, and seems well within the range of transcription errors.
In particular, note that several of them contain embedded qs and ys, which are notoriously word-initial and word-final. Therefore, those exceptions may be the result of lost word breaks. For instance, the -chedyk- root comes from the word "chedykar" which may well be a "chedy" and a "kar" (two fairly common words) run together.
It seems that prefixes, suffixes, and unifixes can be further decomposed into a sequence of EVA letter groups, which themselves are drawn from a limite repertoire. Some common soft-letter groups are
am ar al om or ol ain aiin oin oiinand there seem to be fairly strong restrictions as to how these groups and othe soft letters can be concatenated. For example,
As we all know, q is almost always word-initial, and followed by o.
Similarly, y m n is almost always word-final.
There are several pairs of soft letters that, like qo, behave almost as single letters, e.g. { ar am air dy ol or ... }.
The midfixes too seem to be composed out of a small number of building blocks, where each block is any of the letters
k t sh ch cth ckhfollowed by zero, one, or two e characters. Midfixes with three or more consecutive es, or beginning with e, can be explained as mis-transcriptions of other characters, chiefly ch. In fact, such errors may be the source for many of the ee groups seen in the midfix. (Note that -ee- is rare but -e- is rarer still.)
The VMs is written in cypher. I will leave this hypothesis to the crypto experts.
The Voynich "words" are syllabes; the two classes of letters defined above are basically the vowels and consonants.
Which class is which? Note that there are many words made entirely of soft letters, but no words made entirely of hard letters. Also, the empty prefix occurs very often, while the empty midfix and prefix are rare. Thus it seems that
Note that the soft letters may include sounds like "y", "w", "s", "l", "n", "m" which may work as vowel modifiers rather than consonants proper.
Keeping this in mind, the statistics for syllabes of each type are:
type freq perc ---- ---- ---- V 1516 24 % CV 1849 30 % VCV 2797 45 % VC 20 0 % C 10 0 %
Here V stands for one or more vowels, C for one or more consonants.
Note that there are about 10-12 significant prefixes, and about 20 significant suffixes; which seems right for many languages, including English (12 vowel sounds, a couple dozen vowel clusters).
The number of consonants seems a bit to high: around 20 "simple" consonants, plus a long tail of consonant pairs.
A problem with this theory is why would the author choose to mark off syllabe breaks instead of words breaks.
Voynichese is a a tonal language like Chinese or Vietnamese. This is a variant of the "syllabe" theory above; the difference is mainly that some of the letters (perhaps the prefix) would have to indicate the tones.
This alternative has the merit that, in Chinese, the syllabes are indeed the natural unit of text.
Voynichese is an agglutinative language like Turkish, Nahuatl, Quechwa, etc: the "hard" letters are the stem of the word, and the soft letters are modifying affixes.
Voychinese is a semitic language like Arabic or Hebrew; the prefix, midfix, and suffix correspond to the three basic consonants, and attached vowels.