Hacking at the Voynich manuscript - Side notes 017 OKOKOKO: The fine structure of Voynichese words Last edited on 1999-02-02 14:01:55 by stolfi [ A first version of this note was posted around 1998-03-11, to the voynich mailing list. This version was extensively revised between 1998-03-21 and 1998-03-29. The section about word and line breaks was added on 1999-02-01. ] [ If you decide to print this note, be warned that some lines have almost 120 characters.] The basic QOIXEOIXEO paradigm ----------------------------- Let "X" be any set of letters. We can always break any string whatsoever into zero or more "X"s, each surrounded by letters which are not "X"s: N X N X N X N ... X N where "X" represents exactly one letter from the set, and "N" is any string (possibly empty) of non-"X" letters. Now let's apply this decomposition to the Voynichese words, using as "X" the set of letters { sh ch ee k ckh ck ikh t cth ct ith f cfh cf ifh p cph cp iph d r l s g m n } (I am using the basic EVA alphabet, without capitals). It turns out that, for this choice of "X", the intervening "N" strings are highly constrained. In fact, most words can be decomposed as Q O I X E O I X E O I X E O... I X E O where Q is empty or "q"; O is zero or more elements from the set A = { a o y }; I is empty, or one of { i ii iii }; E is empty, or "e". The QOKOKOKO schema ------------------- In fact, we can constrain these pieces even more. With very few exceptions, "E" may be non-empty only after { sh ch ee k ckh t cth p cph f cfh d } "I" may be non-empty only before { r l g m n s d } Note that "d" is exceptional in that it may be accompanied by either "e" or "i" strings; but the two are mutually exclusive. (In fact the letter pairs "id" and "de" are both extremely rare.) That is, we can write the generic word as Q O K O K O K ... K O K O where O is as above, and K is one of the "main elements" { k t p f ke te pe fe ckh cth cph cfh ckhe cthe cphe cfhe ikh ith iph ifh ck ct cf cp sh ch ee she che eee de d r l g m n s id ir il ig im in is iid iir iil iig iim iin iis iiid iiir iiil iiig iiim iiin iiis } Note that * The letters "p" and "f" are probably ornate versions of other letters: most likely "k" and "t", but perhaps others. * Various statistics suggest that "k" and "t" may be the same letter. * Ditto for "p" and "f". * Ditto for "y" and "o". * Ditto for "g" and "m". * The letter "q" does not seem to be part of the word; it may be an abbreviation for "and". * The groups { ikh ith iph ifh } may be equivalent to { ckh cth cph cfh }, respectively. * Instances of { ee eee } may be instances { ch che } with missing ligature. Finally, many of the "K" elements are so rare that they are probably errors. If we consider only elements with frequency 0.1% or higher, and exclude the elements with "i*h", "p", and "f", we are left with only 25 "significant" elements: K* = { k ke ckh ckhe t te cth cthe ch che sh she ee eee l m s d n r in ir iin iir iiin } Parsing ambiguities ------------------- Note that the inclusion in "X" of the groups { ikh ith iph ifh } does not create any ambiguity with the "I" modifiers, since the presence of "h" after a tall letter forces one to parse the preceding letter (which must be "i" or "c") as part of the same element. Indeed, the elements { ikh ith iph ifh } may be merely calligraphic variants of { ckh cth cph cfh }, and are the only instances where the letters { k t p f } may be preceded by "i". On the other hand, including the string "ee" in the set "X" leads to an ambiguity in the parsing of words with three or more consecutive "e"s. For example, "okeeedy" could be parsed either as Q O I X E O I X E O I X E O - o - k - - - ee e - - d - y or as Q O I X E O I X E O I X E O - o - k e - - ee - - - d - y Several Voynichologists (Rene and Dennis, among others) are unhappy about this ambiguity; they favor excluding "ee" from the set "X", and perhaps allowing "ee" and "eee" as possible "E" modifiers. But there are reasons for including "ee" in "X". For one thing, while an isolated "e" is pretty common within words, it practically never occurs right after { d r l } or before the first "X"; but "ee" and "eee" often occurs in those positions. That is, while a single "e" must always be attached to a preceding "X", the groups "ee" and "eee" can stand on their own, like the other "X" groups. (One could argue that the "c" in the elements { ck ct cf cp }, which may occur before any other "X" group in some words, is in fact an instance of "e". However, in the few cases I have checked, the "c" has a noticeable ligature, even though the matching "h" is missing. So it seems indeed valid to write those combinations with "c" and not with "e".) One must keep in mind also that an "ee" group may well be a "ch" element whose ligature was omitted (by the scribe or the transcriber). Similarly, the very rare occurrences of "se" may well be instances of "sh" with missing ligature. Conversely, it may be that the `natural' form of the letters { ch che sh she } is { ee eee se see }, respectively; and the ligatures are optional calligraphic devices added to clarify the parsing, almost as an afterthought. Parsing the text ---------------- The words that fail this "QOKOKOKO" pattern are quite rare. Let's count them in the following files: hea-u.wds a few herbal-A pages, which I carefully transcribed from Jacques Guy's images; hea-f.wds herbal-A pages in Friedman's transcription; heb-f.wds herbal-B pages in Friedman's transcription; bio-f.wds biological (language B) pages in Friedman's transcription; vdp-z.wds a list of all words that occur at least twice, transcribed by the EVMT team. (The "-f" files were created between 97-11-11 and 98-11-12, as {hea,heb,bio}-f-gut.wds, from Landini's interlinear converted to EVA. The last one was created by expanding a word frequency list posted by Rene Zandbergen on march/98; an entry "N W" in that list generated "N" copies of word "W" in file "vdp-z.wds".) foreach file ( hea-u hea-f heb-f bio-f vdp-z ) cat ${file}.wds \ | egrep -v '[*]' \ | sed -f factor-OK.sed \ > ${file}.fac cat ${file}.fac \ | egrep -e '[#@%=]' \ > ${file}-weird.fac dicio-wc ${file}.fac ${file}-weird.fac end --- factor-OK.sed ------------------------ # Map "sh", "ch", and "ee" to single letters to simplify the parsing. # Note that "eee" groups are paired off from left end. s/ch/C/g s/sh/S/g s/ee/E/g # Map platformed and half-platformed letters to capitals to simplify the parsing: s/ckh/K/g s/cth/T/g s/cfh/F/g s/cph/P/g # s/ikh/G/g s/ith/H/g s/ifh/M/g s/iph/N/g # s/ck/U/g s/ct/V/g s/cf/X/g s/cp/Y/g # Put down scanning head in "@" state s/$/@/ :x # If in "@" state, copy "[aoy]" group, and switch to "#" state: s/\([aoy][aoy]*\)@/#\1/ s/@/#_/ # If in "#" state, copy next main letter and "e" complements, # insert "}" delimiter, and switch to "%" or "=" state depending on # whether "i"s are allowed or not: s/\([CSEktfpKTFPd]e\)#/=\1}/g s/\([CSEktfpKTFPGHMNUVXY]\)#/=\1}/g s/\([rlgmnsd]\)#/%\1}/g # If in "%" state, attach "i" string to group, go to "=" state: s/\(iii\)%/=\1/ s/\(ii\)%/=\1/ s/\(i\)%/=\1/ s/%/=/ # If in "=" state, insert "{" delimiter, and go back to "@" state: s/=/@{/ tx # We should exit the loop only in the "#" state. # Split "q" prefix and discard scanning head if done: s/^[q]#/{q}/ s/^#/{_}/ # Unfold letter folding: s/U/ck/g s/V/ct/g s/X/cf/g s/Y/cp/g # s/G/ikh/g s/H/ith/g s/M/ifh/g s/N/iph/g # s/K/ckh/g s/T/cth/g s/P/cph/g s/F/cfh/g # s/C/ch/g s/S/sh/g s/E/ee/g ------------------------------------------ lines words bytes file ------ ------- --------- ------------ 803 803 11751 hea-u.fac 0 0 0 hea-u-weird.fac lines words bytes file ------ ------- --------- ------------ 7812 7812 113448 hea-f.fac 93 93 1144 hea-f-weird.fac lines words bytes file ------ ------- --------- ------------ 3223 3223 47932 heb-f.fac 46 46 564 heb-f-weird.fac lines words bytes file ------ ------- --------- ------------ 6182 6182 90650 bio-f.fac 39 39 474 bio-f-weird.fac lines words bytes file ------ ------- --------- ------------ 28939 28939 420444 vdp-z.fac 142 142 1339 vdp-z-weird.fac So, the exceptions to the QOKOKOKO pattern are less than 1.5% in Friedman's transcription, less than 0.5% in Rene's list, and none in my own transcription. (The last result is not that impressive, of course. Even though I did my transcription before I had worked out the structure above, I already had some intuition about it, so my reading was not impartial.) The exceptions in Rene's word list ---------------------------------- Here is a breakdown of the 142 words (counting multiple occurrences) in Rene's file that did not fit the QOKOKOKO pattern. (Let's keep in mind that Rene's file only includes words that occur at least twice.) It seems that some of these exceptions can be explained as "mutations" from other letters: scribal errors, calligraphic variations, pen running out of ink, vellum defects, spots, fading, and of course poor copy quality. Some are harder to explain, however, and may require extending the basic schema. * Words with groups { ckhh cthh cphh cfhh } (42 cases): chckhhy(9) cthhy(4) chcthhy(4) shcthhy(3) qcthhy(3) ckhhy(3) chcphhy(3) chcfhhy(3) shocthhy(2) shcphhy(2) qcphhedy(2) ockhhy(2) ocfhhy(2) These exceptions account for 0.15% of all words. I propose that these are calligraphic accidents; that is, "ckhh" is a "ckhe" whose ligature was overextended, and similarly for the other groups. * Words with "oe" (41 cases): qoedy(5) qoedaiin(3) oedy(2) qoeol(5) qoear(2) qoeor(2) qoekeey(3) oekaiin(3) qoekol(2) oekeey(2) qoekedy(2) oekey(2) oekeody(2) choety(2) choeky(2) sheoeky(2) These exceptions account for approximately 0.15% of all words. The cases with "eke" could be explained as instances of "ckh" with missing ligature. The others may be true exceptions to the schema. Note that the "oe" occurs only at the beginning of the word, or after the initial "q", or after an initial "ch" or "she" (which, in language A, seem to behave like "q" to some extent). * Words beginning with "e" or "qe" (20 cases): ety(6) qekeey(3) qekchdy(3) qety(2) qekor(2) qekaiin(2) etaiin(2) These word-initial "e"s could be explained as partly erased instances of { a o y }. Note that if we replace the initial "e" by "o" or "y" we get fairly common words in all these cases. * Words with the special letters "x" and "v" (20 cases): x(10) v(8) xar(2) Note that these letters (picnic table and caret) occur mostly as isolated letters. Therefore, they may be non-phonetic symbols, or abbreviations. * Words with "e" after "s" (5 cases): chsey(3) shese(2) These exceptions could be instances of "sh" without the ligature. * Isolated "e"s (4 cases): e(4) These exceptions could be instances of "s" with missing plume. * Words with "eeb" (3 cases): cheeb(3) I propose that "eeb" is merely a calligraphic variation of "an" or "iin". * Words with "ykh" (3 cases): ykhey(3) I can't think of a good explanation for these cases. * Letter "o" before "q" (2 cases): oqokain(2) Perhaps the extra "o" is a separate word, or part of the previous one? * Letter "i" in word-final position (2 cases): okai(2) These exceptions could be truncated "in" or "ir" groups. Frequencies for "K" elements ---------------------------- Here are the statistics for the "K" groups. foreach file ( hea-u hea-f heb-f bio-f vdp-z ) cat ${file}.fac \ | egrep -v '[@%#=]' \ | sed \ -e 's/^[^{}]*{//g' \ -e 's/}[^{}]*$//g' \ -e 's/}[^{}]*{/./g' \ | tr '.' '\012' \ | egrep -e '.' \ | sort | uniq -c | expand | sort -b +0 -1nr \ | compute-freqs | sed -e 's/^ //g' \ > ${file}-k.frq dicio-wc ${file}-k.frq end lines file ------ ------------ 39 hea-u-k.frq 41 hea-f-k.frq 36 heb-f-k.frq 35 bio-f-k.frq 44 vdp-z-k.frq multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-k.frq hea-u hea-f heb-f bio-f vdp-z ---------------- ---------------- ---------------- ---------------- ---------------- 752 0.304 _ 7024 0.297 _ 2856 0.284 _ 4627 0.242 _ 24167 0.273 _ 292 0.118 ch 2524 0.107 ch 1459 0.145 d 2512 0.131 d 9928 0.112 d 216 0.087 d 2194 0.093 d 765 0.076 k 2140 0.112 l 7523 0.085 l 183 0.074 l 1702 0.072 l 608 0.060 l 1516 0.079 q 6470 0.073 k 178 0.072 r 1466 0.062 r 600 0.060 r 1422 0.074 k 4855 0.055 r 119 0.048 t 1257 0.053 k 557 0.055 ch 828 0.043 che 4698 0.053 ch 102 0.041 k 1177 0.050 t 424 0.042 iin 804 0.042 r 4630 0.052 q 101 0.041 iin 1090 0.046 iin 366 0.036 che 775 0.041 ee 3641 0.041 t 93 0.038 sh 832 0.035 sh 353 0.035 t 723 0.038 iin 3545 0.040 iin 76 0.031 s 695 0.029 q 321 0.032 q 670 0.035 she 3384 0.038 ee 58 0.023 cth 632 0.027 s 250 0.025 ee 615 0.032 t 3328 0.038 che 51 0.021 che 464 0.020 che 207 0.021 ke 476 0.025 ch 1663 0.019 she 51 0.021 q 453 0.019 cth 176 0.017 s 377 0.020 ke 1644 0.019 sh 36 0.015 m 353 0.015 ee 176 0.017 she 357 0.019 s 1608 0.018 s 23 0.009 ee 253 0.011 m 175 0.017 sh 316 0.017 sh 1428 0.016 in 22 0.009 in 216 0.009 p 123 0.012 p 194 0.010 te 1370 0.015 ke 20 0.008 p 187 0.008 she 113 0.011 m 168 0.009 p 789 0.009 te 19 0.008 she 186 0.008 in 110 0.011 te 142 0.007 ckh 734 0.008 p 17 0.007 ckh 176 0.007 ckh 89 0.009 ckh 113 0.006 in 632 0.007 m 11 0.004 te 130 0.005 ke 74 0.007 f 81 0.004 cth 573 0.006 cth 8 0.003 ke 78 0.003 cph 67 0.007 in 72 0.004 m 511 0.006 ckh 7 0.003 cph 75 0.003 f 51 0.005 ir 42 0.002 ckhe 379 0.004 ir 5 0.002 ir 75 0.003 n 38 0.004 cth 31 0.002 ir 223 0.003 eee 4 0.002 ct 70 0.003 te 24 0.002 cthe 29 0.002 cthe 177 0.002 ckhe 4 0.002 iiin 65 0.003 cthe 19 0.002 ckhe 21 0.001 eee 134 0.002 cthe 4 0.002 n 59 0.002 ir 13 0.001 eee 21 0.001 f 125 0.001 f 3 0.001 cthe 57 0.002 eee 13 0.001 iir 12 0.001 cphe 116 0.001 iiin 3 0.001 de 47 0.002 ckhe 9 0.001 iiin 10 0.001 n 95 0.001 iir 3 0.001 eee 27 0.001 cfh 6 0.001 cphe 7 0.000 cph 82 0.001 cph 3 0.001 f 24 0.001 iir 5 0.000 cfh 7 0.000 iiin 67 0.001 n 3 0.001 iir 21 0.001 cphe 5 0.000 cph 5 0.000 de 43 0.000 cphe 2 0.001 cfh 20 0.001 iiin 5 0.000 de 4 0.000 cfh 26 0.000 g 2 0.001 ck 8 0.000 de 5 0.000 n 3 0.000 il 21 0.000 im 1 0.000 cf 6 0.000 iim 4 0.000 cfhe 2 0.000 iir 18 0.000 cfh 1 0.000 ckhe 3 0.000 cfhe 2 0.000 id 1 0.000 pe 12 0.000 ikh 1 0.000 cphe 3 0.000 iid 1 0.000 iil 10 0.000 ct 1 0.000 iid 3 0.000 iil 8 0.000 ith 1 0.000 iim 2 0.000 id 7 0.000 ck 1 0.000 im 2 0.000 iis 7 0.000 iid 1 0.000 il 7 0.000 il 1 0.000 is 2 0.000 cfhe 2 0.000 de 2 0.000 iim 2 0.000 iis In these tables, the "_" entry represents the empty "Q" slot. Let's extract from those tables the elements that are not in the reduced set "K*" and are not simple uses of the `jokers' "p" and "f": foreach file ( hea-u hea-f heb-f bio-f vdp-z ) cat ${file}-k.frq \ | egrep -v ' (([ktpf]|c[ktpf]h|[cs]h|ee)(e|)|[_qlmsdnr]|i[nr]|ii[nr]|iiin)$' \ > ${file}-knr.frq end multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-knr.frq hea-u hea-f heb-f bio-f vdp-z --------------- --------------- --------------- -------------- --------------- 4 0.002 ct 8 0.000 de 5 0.000 de 5 0.000 de 26 0.000 g 3 0.001 de 6 0.000 iim 2 0.000 id 3 0.000 il 21 0.000 im 2 0.001 ck 3 0.000 iid 1 0.000 iil 12 0.000 ikh 1 0.000 cf 3 0.000 iil 10 0.000 ct 1 0.000 iid 2 0.000 id 8 0.000 ith 1 0.000 iim 2 0.000 iis 7 0.000 ck 1 0.000 im 1 0.000 il 7 0.000 iid 1 0.000 is 7 0.000 il 2 0.000 de 2 0.000 iim 2 0.000 iis Recall that strings with three or more "e"s have ambiguous parsing, which affects the statistics of "ee" and all elements with the "e" modifier. The factor-Ok script arbitrarily pairs the "e"s from the left, so that such strings are parsed as as zero or more "ee"s followed by one "ee" or "eee". To assess the implications of this ambiguity, let's check how many ambiguous strings we have in each file: foreach file ( hea-u hea-f heb-f bio-f vdp-z ) cat ${file}.wds \ | egrep -v '[*]' \ | sed -e 's/[^e]/./g' \ | tr '.' '\012' \ | egrep '.' \ | sort | uniq -c | expand | sort +0 -1nr \ | compute-freqs | sed -e 's/^ //g' \ > ${file}-eee.frq dicio-wc ${file}-eee.frq end multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-eee.frq hea-u hea-f heb-f bio-f vdp-z --------------- ---------------- --------------- --------------- --------------- 97 0.789 e 1069 0.721 e 952 0.782 e 2187 0.732 e 7593 0.677 e 23 0.187 ee 355 0.239 ee 253 0.208 ee 779 0.261 ee 3395 0.303 ee 3 0.024 eee 57 0.038 eee 13 0.011 eee 21 0.007 eee 223 0.020 eee 2 0.001 eeee Note that, surprisingly, there are practically no words with four ot more "e"s in a row. My factoring script will parse the "eee" strings as one "eee" element. In all files, the frequency of the "eee" element is less than 0.003 ( i.e. 0.3% of the total "K" elements) Therefore, if I had used the other parsing ("e" + "ee"), the frequencies of "ee" and all other "e"-modified elements would increase by less than 0.003 in total. By the way, the low frequency of "eee" probably means that its ambiguity would be no big problem for the intended readers. In fact, the absence of "eeee"s could be explained by the following theory: the letters "ch" and "sh" are officially written "ee" and "se"; since that would lead to ambiguities, the scribe routinely (but not invariably) adds ligatures to indicate the intended grouping. Frequencies of "K" elements in languages A and B ------------------------------------------------ In the "K" frequency tables above we can already see a marked difference between languages A and B. Looking only at the reduced element subset K*, plus "q" and "_" (meaning no "q"): foreach file ( hea-f heb-f ) cat ${file}-k.frq \ | egrep ' (([kt]|c[kt]h|[cs]h|ee)(e|)|[_qlmsdnr]|i[nr]|ii[nr]|iiin)$' \ > ${file}-kr.frq end multicol {hea-f,heb-f}-kr.frq hea-f heb-f ---------------- ---------------- 7024 0.297 _ 2856 0.284 _ 2524 0.107 ch 1459 0.145 d 2194 0.093 d 765 0.076 k 1702 0.072 l 608 0.060 l 1466 0.062 r 600 0.060 r 1257 0.053 k 557 0.055 ch 1177 0.050 t 424 0.042 iin 1090 0.046 iin 366 0.036 che 832 0.035 sh 353 0.035 t 695 0.029 q 321 0.032 q 632 0.027 s 250 0.025 ee 464 0.020 che 207 0.021 ke 453 0.019 cth 176 0.017 s 353 0.015 ee 176 0.017 she 253 0.011 m 175 0.017 sh 187 0.008 she 113 0.011 m 186 0.008 in 110 0.011 te 176 0.007 ckh 89 0.009 ckh 130 0.005 ke 67 0.007 in 75 0.003 n 51 0.005 ir 70 0.003 te 38 0.004 cth 65 0.003 cthe 24 0.002 cthe 59 0.002 ir 19 0.002 ckhe 57 0.002 eee 13 0.001 eee 47 0.002 ckhe 13 0.001 iir 24 0.001 iir 9 0.001 iiin 20 0.001 iiin 5 0.000 n There is also a less marked but still significant difference between herbal-B and bio-B: foreach file ( heb-f bio-f ) cat ${file}-k.frq \ | egrep ' (([kt]|c[kt]h|[cs]h|ee)(e|)|[_qlmsdnr]|i[nr]|ii[nr]|iiin)$' \ > ${file}-kr.frq end multicol {heb-f,bio-f}-kr.frq heb-f bio-f ---------------- ---------------- 2856 0.284 _ 4627 0.242 _ 1459 0.145 d 2512 0.131 d 765 0.076 k 2140 0.112 l 608 0.060 l 1516 0.079 q 600 0.060 r 1422 0.074 k 557 0.055 ch 828 0.043 che 424 0.042 iin 804 0.042 r 366 0.036 che 775 0.041 ee 353 0.035 t 723 0.038 iin 321 0.032 q 670 0.035 she 250 0.025 ee 615 0.032 t 207 0.021 ke 476 0.025 ch 176 0.017 s 377 0.020 ke 176 0.017 she 357 0.019 s 175 0.017 sh 316 0.017 sh 113 0.011 m 194 0.010 te 110 0.011 te 142 0.007 ckh 89 0.009 ckh 113 0.006 in 67 0.007 in 81 0.004 cth 51 0.005 ir 72 0.004 m 38 0.004 cth 42 0.002 ckhe 24 0.002 cthe 31 0.002 ir 19 0.002 ckhe 29 0.002 cthe 13 0.001 eee 21 0.001 eee 13 0.001 iir 10 0.001 n 9 0.001 iiin 7 0.000 iiin 5 0.000 n 2 0.000 iir However, most of that difference disappears if we: (1) identify the letters { k t p f}, which we have good reasons to believe are the same letter; (2) omit the letter "q", which is believed to be a symbol for "and", and hence might be correlated with subject matter; (3) identify "ee" with "ch". foreach file ( hea-u hea-f heb-f bio-f vdp-z ) cat ${file}.fac \ | egrep -v '[@%#=]' \ | sed \ -e 's/^[^{}]*{//g' \ -e 's/}[^{}]*$//g' \ -e 's/}[^{}]*{/./g' \ -e 's/[ktpf]/k/g' \ -e 's/ee/ch/g' \ -e 's/q//g' \ | tr '.' '\012' \ | egrep -e '.' \ | sort | uniq -c | expand | sort -b +0 -1nr \ | compute-freqs | sed -e 's/^ //g' \ | egrep ' (([kt]|c[kt]h|[cs]h|ee)(e|)|[_qlmsdnr]|i[nr]|ii[nr]|iiin)$' \ > ${file}-krr.frq dicio-wc ${file}-krr.frq end multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-krr.frq hea-u hea-f heb-f bio-f vdp-z ---------------- ---------------- ---------------- ---------------- ---------------- 752 0.310 _ 7024 0.306 _ 2856 0.293 _ 4627 0.263 _ 24167 0.288 _ 315 0.130 ch 2877 0.125 ch 1459 0.150 d 2512 0.143 d 10970 0.131 k 244 0.101 k 2725 0.119 k 1315 0.135 k 2226 0.126 k 9928 0.118 d 216 0.089 d 2194 0.096 d 807 0.083 ch 2140 0.122 l 8082 0.096 ch 183 0.075 l 1702 0.074 l 608 0.062 l 1251 0.071 ch 7523 0.089 l 178 0.073 r 1466 0.064 r 600 0.062 r 849 0.048 che 4855 0.058 r 101 0.042 iin 1090 0.047 iin 424 0.043 iin 804 0.046 r 3551 0.042 che 93 0.038 sh 832 0.036 sh 379 0.039 che 723 0.041 iin 3545 0.042 iin 84 0.035 ckh 734 0.032 ckh 317 0.033 ke 670 0.038 she 2159 0.026 ke 76 0.031 s 632 0.028 s 176 0.018 s 572 0.032 ke 1663 0.020 she 54 0.022 che 521 0.023 che 176 0.018 she 357 0.020 s 1644 0.020 sh 36 0.015 m 253 0.011 m 175 0.018 sh 316 0.018 sh 1608 0.019 s 22 0.009 in 200 0.009 ke 137 0.014 ckh 234 0.013 ckh 1428 0.017 in 19 0.008 ke 187 0.008 she 113 0.012 m 113 0.006 in 1184 0.014 ckh 19 0.008 she 186 0.008 in 67 0.007 in 83 0.005 ckhe 632 0.008 m 5 0.002 ckhe 136 0.006 ckhe 53 0.005 ckhe 72 0.004 m 379 0.005 ir 5 0.002 ir 75 0.003 n 51 0.005 ir 31 0.002 ir 356 0.004 ckhe 4 0.002 iiin 59 0.003 ir 13 0.001 iir 10 0.001 n 116 0.001 iiin 4 0.002 n 24 0.001 iir 9 0.001 iiin 7 0.000 iiin 95 0.001 iir 3 0.001 iir 20 0.001 iiin 5 0.001 n 2 0.000 iir 67 0.001 n Statistics of "O" strings ------------------------- Now, what do we do with the "O" strings? Let's look at their statistics: foreach file ( hea-u hea-f heb-f bio-f vdp-z ) cat ${file}.fac \ | egrep -v '[@%#=]' \ | sed -e 's/{[^{}]*}/./g' \ | tr '.' '\012' \ | egrep -e '.' \ | sort | uniq -c | expand | sort -b +0 -1nr \ | compute-freqs | sed -e 's/^ //g' \ > ${file}-ooo.frq dicio-wc ${file}-ooo.frq end lines file ------ ------------ 9 hea-u-ooo.frq 15 hea-f-ooo.frq 11 heb-f-ooo.frq 12 bio-f-ooo.frq 11 vdp-z-ooo.frq multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-ooo.frq hea-u hea-f heb-f bio-f vdp-z -------------- --------------- -------------- --------------- -------------- 1364 0.551 _ 12782 0.540 _ 5371 0.533 _ 10295 0.538 _ 45585 0.514 _ 595 0.240 o 5444 0.230 o 1712 0.170 y 3558 0.186 o 18671 0.211 o 262 0.106 y 3069 0.130 y 1616 0.160 o 3413 0.178 y 13615 0.154 y 234 0.094 a 2188 0.092 a 1325 0.132 a 1835 0.096 a 10544 0.119 a 11 0.004 oa 70 0.003 oa 16 0.002 oa 6 0.000 yo 171 0.002 oa 6 0.002 oy 59 0.002 oy 11 0.001 oy 4 0.000 oy 51 0.001 oy 2 0.001 ao 16 0.001 oo 8 0.001 oo 3 0.000 ay 23 0.000 oo 2 0.001 oo 14 0.001 yo 6 0.001 yo 3 0.000 oa 12 0.000 yo 1 0.000 yo 5 0.000 ay 2 0.000 ay 2 0.000 ao 6 0.000 ay 4 0.000 ya 1 0.000 ao 2 0.000 ya 6 0.000 ya 2 0.000 ao 1 0.000 ya 1 0.000 aoy 2 0.000 yy 2 0.000 yoa 1 0.000 oaa 1 0.000 aa 1 0.000 oao 1 0.000 yay Thus, the only common alternatives are empty, "y", "a", and "o". In fact, as we know, the alternative "y" is common only in initial and final positions; and in those positions it seems to be equivalent to "o". Note that about half of the "O" slots are filled (i.e. the ratio K:O is about 2:1). Therefore, if the "K" elements were randomly mixed with "O" letters, the "O" slots should be about 67% empty, 22% single-letter, 7% double-letter, and 2% triple-letter. Instead we see about 50% empty, 50% single-letter, <1% double-letter, and <0.1% triple-letter. In fact, triple-letter "O"s are so rare that they can be assumed to be errors. In Rene's good-quality word list (vdp-z.wds) there are no triple-letter "O"s at all. Statistics of "K" strings ------------------------- Let's now look at the clusters of "K" elements between consecutive non-empty "O"s. To reduce the size of the output, let's map the letters { k t p f } to "k", and "ch" to "ee": foreach file ( hea-f heb-f bio-f vdp-z ) cat ${file}.fac \ | egrep -v '[@%#=]' \ | sed \ -e 's/^{[q_]}//g' \ -e 's/^_//g' \ -e 's/_$//g' \ -e 's/[oay]/./g' \ -e 's/[{}]//g' \ -e 's/[ktpf]/k/g' \ -e 's/ch/ee/g' \ | tr '.' '\012' \ | egrep -e '.' \ | sort | uniq -c | expand | sort -b +0 -1nr \ | compute-freqs | sed -e 's/^ //g' \ > ${file}-kkk.frq dicio-wc ${file}-kkk.frq end lines file ------ ------------ 257 hea-f-kkk.frq 213 heb-f-kkk.frq 265 bio-f-kkk.frq 232 vdp-z-kkk.frq multicol {hea-f,heb-f,bio-f,vdp-z}-kkk.frq > multi-kkk.frq hea-f heb-f bio-f vdp-z ------------------------- -------------------------- --------------------------- ----------------------- 1733 0.131 d 663 0.136 k 1424 0.165 l 5345 0.119 k 1441 0.109 l 568 0.116 r 1148 0.133 k 5296 0.118 l 1380 0.104 r 550 0.113 d 729 0.084 r 4714 0.105 r 1237 0.093 k 419 0.086 iin 720 0.083 iin 4146 0.093 d 1164 0.088 ee 408 0.084 l 464 0.054 d 3534 0.079 iin 1078 0.081 iin 172 0.035 ke_d 379 0.044 ke_d 1994 0.045 k_ee 931 0.070 k_ee 161 0.033 k_ee_d 331 0.038 k_ee_d 1861 0.042 ee 592 0.045 ckh 114 0.023 s 260 0.030 s 1424 0.032 in 553 0.042 sh 107 0.022 m 258 0.030 she_d 1232 0.028 s 426 0.032 s 105 0.022 k_ee 230 0.027 eee_d 1068 0.024 k_ee_d 235 0.018 m 94 0.019 ee 175 0.020 k_ee 1052 0.023 eee 229 0.017 eee 92 0.019 ke 148 0.017 eee 1036 0.023 ke 179 0.014 ke 87 0.018 ee_d 147 0.017 she 868 0.019 ke_d 174 0.013 in 77 0.016 eee_d 112 0.013 in 813 0.018 sh 149 0.011 k_eee 70 0.014 eee 111 0.013 l_k 643 0.014 she 133 0.010 she 63 0.013 in 104 0.012 ke 632 0.014 m 114 0.009 d_ee 60 0.012 sh 99 0.011 l_eee_d 631 0.014 eee_d 112 0.008 ckhe 55 0.011 eee_k 87 0.010 k_eee_d 622 0.014 ckh 110 0.008 k_sh 52 0.011 ee_ckh 67 0.008 m 459 0.010 she_d 106 0.008 l_d 51 0.010 l_d 66 0.008 l_d 428 0.010 k_eee 64 0.005 n 50 0.010 ir 65 0.008 sh 406 0.009 l_k 58 0.004 ee_ckh 49 0.010 she 63 0.007 ee_ckh 398 0.009 k_eee_d .... ..... .......... .... ..... ............... .... ..... ................ .... ..... ............. 1 0.000 ckh_s_ee_s 1 0.000 ee_sh_d 2 0.000 l_l 6 0.000 she_ke 1 0.000 ckh_sh 1 0.000 eee_ckh_d 2 0.000 l_sh_ee_s 5 0.000 d_sh_ee_d 1 0.000 ckhe_iin 1 0.000 eee_ckhe 2 0.000 l_she_ckh 5 0.000 il 1 0.000 ckhe_k_k_k_l 1 0.000 eee_ckhe_d 2 0.000 l_she_k 5 0.000 sh_ee_k_ee 1 0.000 d_ee_ee_ckhe 1 0.000 eee_ee 2 0.000 r_ee_r 4 0.000 d_ee_ee_d 1 0.000 d_ee_ee_s 1 0.000 eee_eee 2 0.000 r_eee_k 4 0.000 d_sh_d 1 0.000 d_ee_eee 1 0.000 eee_k_ee_ee 2 0.000 r_k 4 0.000 ee_ee_k_ee .... ..... .......... .... ..... ............... .... ..... ................ .... ..... ............. Obviously, groups of two or more consecutive "K" elements are quite common. Here is the frequency for each repeat count: foreach file ( hea-f heb-f bio-f vdp-z ) cat ${file}.fac \ | egrep -v '[@%#=]' \ | sed \ -e 's/^{[q_]}//g' \ -e 's/^_//g' \ -e 's/_$//g' \ -e 's/[oay]/./g' \ -e 's/[{}]//g' \ -e 's/[a-z][a-z]*/x/g' \ | tr '.' '\012' \ | egrep -e '.' \ | sort | uniq -c | expand | sort -b +0 -1nr \ | compute-freqs | sed -e 's/^ //g' \ > ${file}-kn.frq dicio-wc ${file}-kn.frq end lines file ------ ------------ 5 hea-f-kn.frq 6 heb-f-kn.frq 6 bio-f-kn.frq 4 vdp-z-kn.frq multicol {hea-f,heb-f,bio-f,vdp-z}-kn.frq hea-f heb-f bio-f vdp-z --------------------- ----------------------- ----------------------- ------------------- 10849 0.819 x 3387 0.694 x 5527 0.640 x 33290 0.744 x 2149 0.162 x_x 1034 0.212 x_x 1966 0.228 x_x 8124 0.181 x_x 229 0.017 x_x_x 416 0.085 x_x_x 1038 0.120 x_x_x 3077 0.069 x_x_x 20 0.002 x_x_x_x 38 0.008 x_x_x_x 99 0.011 x_x_x_x 280 0.006 x_x_x_x 5 0.000 x_x_x_x_x 5 0.001 x_x_x_x_x 1 0.000 x_x_x_x_x 2 0.000 x_x_x_x_x_x 1 0.000 x_x_x_x_x_x So strings of 3 consecutive "K" elements are relatively common, strings of 4 are rare, and no word that occurs twice has 5 or more "K"s in a row. Recall that about 50% of the "O" slots are empty, and about 50% consist of one letter only. If the "O" slots were filled or empty at random, then we would expect the following statistics 0.500 x 0.250 x_x 0.125 x_x_x 0.063 x_x_x_x 0.031 x_x_x_x_x 0.015 x_x_x_x_x_x 0.007 x_x_x_x_x_x_x So the statistics above suggest that in language A the distribution of "O"s is more uniform than would be expected from chance. (The case is not clear because the presence of short words would bias the statistics towards entries with few consecutive "K"s.) Note the significant difference in K-repeat frequencies for language A and language B. The frequencies for language B are closer to the "random" model. Analysis of "K" and "O" statistics ---------------------------------- What can we conclude from these numbers? Let's consider the alternatives: (1) The EVA letters { a o y } are different Voynichese letters. This theory does not look very promising: if they were different letters, they should belong to the same class (vowel, consonant, whaterver); but then we would expect to see a fair number of diphtongs (double-letter "O" strings), which we don't see. (2) The EVA letters { a o y } are the same Voynichese letter. This theory could explain why there are so few double-letter "O" slots: namely, because the Voynichese letter "o/a/y" cannot occur twice in a row (a common restriction in natural languages). (3) Each "O" string is a modifiers for (i.e. a part of) the next "K" element; except for the final "O" string, which stands on its own. (4) Each "O" string is a modifiers for the preceding "K" element; except for the initial "O" string, which stands on its own. (5) Some "K" element may admit "O" letters as post-modifiers, some may admit them as pre-modifiers, some may admit both. After a quick look, I would guess that { sh ch ee she che eee } admit "a/o/y" only as post-modifiers { r l m n ir iir in iin iiin } admit "a/o/y" only as pre-modifiers { s d k t cth ckh ke te cthe ckhe } admit "a/o/y" in both positions. But this hunch needs to be confirmed... (6) None of the above. Appendix: A more flexible factoring script ------------------------------------------ The logic of factor-OK.sed was rewritten in AWK as a "factor-field-OK" script that allows one to factor a selected field of a multifield file. Checking consistency of the two scripts: foreach file ( hea-u hea-f heb-f bio-f vdp-z ) echo ${file}-old cat ${file}.wds \ | sed -f factor-OK.sed \ > .${file}-old.fac echo ${file}-new cat ${file}.wds \ | factor-field-OK \ | gawk '/./{ print $1; }' \ > .${file}-new.fac dicio-wc .${file}-{old,new}.fac diff .${file}-{old,new}.fac /bin/rm .${file}-{old,new}.fac end The differences are confined to words that factor-OK can't parse. The new script will forcibly factor those words into elements {i+X}, {X[eh]}, or {X} where {X} is a character other than [aoy]. Word and line breaks in the OKOKOKO model ----------------------------------------- [1999-02-01] It is instructive to analyze the immediate contexts of definite word spaces (std), breaks due to figures in the text (fig), intra-paragraph line breaks (lin), and inter-word pairs (non), in terms of the K/E/O classification. For this study we will use the majority-vote transcription, that includes Takeshi's new full transcription. For simplicity, let's discard all data containing weirdos, extra plumes, unreadable characters, or the rare letters [abuvxz]. Let's also map the upper case EVA letters [SCIKTPF] to their lower case varians, since the capitalization carries no information in those cases. cat ../045/only-m.evt \ | egrep -e '^<[^<>]*;A>' \ | tr 'SCIKTPF' 'sciktpf' \ | tr -d '\!' \ | sed \ -e 's/^<[^<>]*> *//g' \ -e 's/[{][^{}]*[}]//g' \ -e 's/[&][0-9*?][0-9*?]*[;]\?/*/g' \ -e 's/[buxvz]/*/g' \ -e 's/[.,]*-[-.,]*/-/g' \ -e 's/[,]*[.][,.]*/./g' \ -e 's/[,][,]*/,/g' \ -e 's/.['"'"'"]/?/g' \ -e 's/[^-,./= ]*[%?*][^-,./= ]*/?/g' \ > base.txt Let's reduce the alphabet to letter classes as follows: O = [aoy] I = [i]+ Q = [q] E = unattached [eh] R = [djmg] and [rlsn] X = <ee>, <ch>, <sh>, <ih>, [ci][ktpf][h], [c][ktpf], [ktpf] The following hack should do it: cat base.txt \ | sed \ -e 's/ee/X/g' \ -e 's/[csi][h]/X/g' \ -e 's/[ci][ktpf][h]/X/g' \ -e 's/[c][ktpf]/X/g' \ -e 's/[ktpf]/X/g' \ -e 's/[rlsn]/R/g' \ -e 's/[mdgj]/R/g' \ -e 's/[aoy]/O/g' \ -e 's/[q]/Q/g' \ -e 's/[i][i]*/I/g' \ -e 's/[ceh]/E/g' \ > base.clt egrep '[^-.,=/?XEQROI]' base.clt > .bugs head -10 .bugs Now let's count the pairs: cat base.clt \ | sed \ -e 's/-\(.\)-/-\1\1-/g' \ -e 's/\(.\)-\(.\)/@\1-\2@/g' \ | tr '@' '\012' \ | egrep -e '^.-.$' \ | egrep -v -e '[?%*]' \ | sort | uniq -c | expand \ | sort +0 -1nr \ | compute-freqs \ > fig.breaks cat base.clt \ | sed \ -e 's/[.]\(.\)[.]/.\1\1./g' \ -e 's/\(.\)[.]\(.\)/@\1.\2@/g' \ | tr '@' '\012' \ | egrep -e '^.[.].$' \ | egrep -v -e '[?%*]' \ | sort | uniq -c | expand \ | sort +0 -1nr \ | compute-freqs \ > std.breaks cat base.clt \ | sed \ -e 's/^[-=., ]*\([^-=., ]\)[-=., ]*$/\1\1/g' \ -e 's/^[-=., ]*\([^-=., ]\)/\1@/g' \ -e 's/\([^-=., ]\)[-., ]*$/@\1\//g' \ | tr -d '\012' \ | tr '@' '\012' \ | egrep -e '^.[/].$' \ | egrep -v -e '[?%*]' \ | sort | uniq -c | expand \ | sort +0 -1nr \ | compute-freqs \ > lin.breaks cat base.clt \ | sed \ -e 's/\([^-=., ]\)/\1@\1\!/g' \ -e 's/[\!:]*[-=:., ][-\!=:., ]*/@/g' \ | tr '@' '\012' \ | egrep -e '^.\!.$' \ | egrep -v -e '[?%*]' \ | sort | uniq -c | expand \ | sort +0 -1nr \ | compute-freqs \ > non.breaks multicol -v titles="non std fig lin" {non,std,fig,lin}.breaks Here is the result: non std fig lin ------------------ ------------------ ------------------ ------------------ 18340 0.1577 O!R 6363 0.2281 R.X 168 0.2205 R-O 636 0.2048 R/O 16712 0.1437 X!O 5797 0.2078 R.O 134 0.1759 O-O 594 0.1913 R/R 14897 0.1281 R!O 3703 0.1328 O.X 107 0.1404 O-R 415 0.1337 O/R 12804 0.1101 O!X 3287 0.1178 O.Q 107 0.1404 R-X 377 0.1214 O/O 9634 0.0829 X!E 2786 0.0999 O.O 100 0.1312 O-X 334 0.1076 R/X 8495 0.0731 X!X 2748 0.0985 O.R 98 0.1286 R-R 262 0.0844 R/Q 6322 0.0544 O!I 1882 0.0675 R.R 16 0.0210 O-Q 248 0.0799 O/X 6298 0.0542 I!R 1068 0.0383 R.Q 9 0.0118 X-X 199 0.0641 O/Q 5160 0.0444 E!O 83 0.0030 X.X 9 0.0118 R-Q 18 0.0058 X/O 5048 0.0434 Q!O 57 0.0020 X.O 6 0.0079 X-R 5 0.0016 X/R 3792 0.0326 E!R 22 0.0008 R.E 4 0.0052 X-O 5 0.0016 O/E 3071 0.0264 R!X 18 0.0006 E.O 3 0.0039 O-E 2 0.0006 X/X 2866 0.0247 X!R 18 0.0006 X.R 1 0.0013 E-R 2 0.0006 X/Q 939 0.0081 E!X 17 0.0006 E.X 2 0.0006 R/E 905 0.0078 R!R 11 0.0004 O.E 1 0.0003 E/O 531 0.0046 O!O 10 0.0004 X.Q 1 0.0003 E/R 152 0.0013 O!E 9 0.0003 E.R 1 0.0003 I/O 93 0.0008 R!E 5 0.0002 E.Q 1 0.0003 I/Q 52 0.0004 Q!X 2 0.0001 I.O 1 0.0003 X/E 37 0.0003 I!X 2 0.0001 I.R 1 0.0003 Q/R 33 0.0003 Q!E 2 0.0001 R.I 28 0.0002 O!Q 1 0.0000 I.X 15 0.0001 R!I 1 0.0000 X.E 13 0.0001 X!I 11 0.0001 I!O 4 0.0000 E!E 4 0.0000 Q!R 3 0.0000 R!Q 1 0.0000 E!I So we can say that (1) Line breaks occur almost only between { R O } and { R O X Q } (with frequencies ranging from 6% to 20% of all line breaks); rarely between X and { R O X Q } (less than 0.9% of all line breaks); and essentially NEVER after { Q I E } or before { I E } (less than 0.4% of all line breaks). (2) Ordinary word breaks follow the same pattern: the pairs between { R O } and { R O X Q } have frequencies between 3.8% and 22%; pairs between X and { R O X Q } have total frequency of 0.6%; and all the remaining pairs account for only 0.3% of the line breaks. (3) Figure breaks too follow almost the same pattern: the pairs { R O } and { X R O } have frequencies ranging from 22% to 12%, but the pairs { R-Q and O-Q } are much rarer than around line breaks and ordinary spaces, about 1--2% each. Breaks between X and { R O X Q } are slightly more common (2.5% total) and all other pairs are almost absent (about 0.5%). (4) The relative frequencies of { R O X Q } are approximately 2:2:1:1 after a line break, and 6:5:5:1 after a figure break, roughly independently of the character before the break. (5) The relative frequencies of { R O X Q } after ordinary breaks seem to depend on the preceding letter, but they are still of the same order of magnitude. (6) Inside words, the valid pairs are { QO, OX, OI, IX, XX, XE, XO, EX, EO } with frequencies ranging from 4.1% to 27%. The remaining pairs have much lower frequencies (OO accounts for 0.46% of all pairs, and OE for only 0.13%). These observations seem to imply that the "word spaces", line breaks, and figure breaks are fairly similar and very distinct from random inter-character pairs. Their similarity, and the relative independence of the next letter strongly suggests that they are indeed word boundaries. In that case we conclude that Voynichese words may end in O or R (40-45% and 50-60%, respectively) or rarely X; and may begin only with X, O, R, or Q. (A more detailed analysis would show that the O at end of words is almost always <y>. Also the last R in a line is most often EVA <m>, which is only rarely seen at the other kinds of word breaks.) Point (3) shows that figure breaks are more like word breaks than random inter-character breaks. The depressed frequencies of X-Q and O-Q call for an explanation, though. Point (6) is a partial restatement of the QOKOKOKO paradigm. Note that the pairs { QO OI IX XE EX EO }, which are fairly common inside words, are not legal places for word spaces, line, or figure breaks. Thes data do not shed much light on whether each O and E is attached to the preceding X, the following X, or sometimes both, or neither. Unfortunately there are (practically) no figure breaks adjacent to an E. Word pattern frequencies ------------------------ It is also instructive to analyze the frequency of each word pattern, the result of collapsing the letters into the classes { Q O X I R E } or { Q O K } defined above. First, the { Q O X I R E } patterns: cat base.clt \ | tr '., =/-' '\012\012\012\012\012\012' \ | egrep '.' \ | egrep -v '[?%*]' \ | sort | uniq -c | expand \ | sort +0 -1nr \ > QOIXER.frq The result is a long-tailed distribution that begins freq pattern ----- ---------- 1832 XOR 1725 OR 1649 ROIR 1413 ROR 1209 OXOR 1084 XERO 940 XXO 903 XEO 817 OXOIR 786 QOXOR 745 XEOR 718 OIR 716 XO 703 OXXO 660 QOXOIR 560 QOXXO 487 R 480 QOXXRO 404 RO 382 OXXRO 379 XXOR 376 QOXERO 375 OXO 372 XOIR 370 OXERO 325 OROR 316 OXEOR 312 OXXOR 309 XXRO 307 OROIR ... ... Let's now collapse the elements { X XE R IR } to a single class K, and absorb the Q into the following O: cat base.clt \ | tr '., =/-' '\012\012\012\012\012\012' \ | egrep '.' \ | egrep -v '[?%*]' \ | sed \ -e 's/XE/K/g' \ -e 's/X/K/g' \ -e 's/IR/K/g' \ -e 's/R/K/g' \ -e 's/QO/O/g' \ | sort | uniq -c | expand \ | sort +0 -1nr \ > QOK.frq The result is still a relatively long-tailed distribution: freq pattern ----- ---------- 6061 KOK 4690 OKOK 3075 KKO 2704 OK 2646 OKKO 2023 KO 1531 OKKKO 1365 OKO 1346 KKOK 1236 KKKO 1052 KOKO 951 OKKOK 861 KOKOK 578 K 561 OKOKO 374 KOKKO 324 KK 309 OKOKOK 265 O 233 KKKOK 219 KKOKO 202 KKKKO 189 KKK 177 OKKOKO 175 OKKK 169 OKK 160 OOK 152 OKOKKO 142 KOKKOK 139 OKKKKO ... ... Conversely, we can analyze the patterns of X and R ignoring the { Q E I O } complements: cat base.clt \ | tr '., =/-' '\012\012\012\012\012\012' \ | egrep '.' \ | egrep -v '[?%*]' \ | tr -d 'QEIO' \ | sort | uniq -c | expand \ | sort +0 -1nr \ > XR.frq The result is still a fairly broad distribution: freq pattern ----- ---------- 10441 XR 4319 RR 4006 R 3768 XXR 3682 XX 2999 X 1461 XRR 1279 RXR 538 RX 480 XXX 463 RRR 409 XXRR 366 RXXR 346 RXX 302 230 XXXR 151 XRXR 132 XRX 116 XRRR 90 RXRR 89 RRXR 59 RRRR 56 XXXX 50 RRX 31 XXRX 30 XXRRR 24 XRXX 23 RXXRR 23 RXXX 22 XRXXR ... ...