In 1997 the French linguist Jacques Guy posted a number of translations of papers written by the Russian B.V. Sukhotin to the Voynich MS mailing list
(1).
This page presents these contributions from Jacques Guy without any significant changes. Jacques included his own comments in these articles surrounded by:
(* *)
Here, they will eventually be further highlighted to clearly distinguish them from the translated text. One paper is omitted, because it was also published in Cryptologia, and the copyrights belong to this journal
(2).
I was about to throw away a boxful of old 5.24-inch disks, and useless curiosity impelled me to check what was on them. Some were unreadable to MS-DOS, Which could have meant that they were corrupt, or that they were antiques from the days of my Kaypro 11. And they were. One held a number of files suh*.* , which turned out to contain commented translations of most of Sukhotin's articles. I had completely forgotten that I had set about translating his works and done so much! When did I move from a Kaypro II to an NEC APC III? It must have been about 14 years ago. Here is the first file, original typing mistakes included.
by B.V. Sukhotin
translated and adapted by Jacques Guy
"Our knowledge Of language X" is twofold.
It is the knowledge we use to translate the signs Of X into their respective referents in the physical and ideal worlds.
It is also the knowledge we use to identify some properties of the signs of X without resorting to their referents. It is for instance the knowledge of English by which, presented with the sentence "I was charded by its unflemming thurn", we identify "charded" as the past participle of "to chard", as the opposite of "flemming", itself derived from "to flem", and "thurn" as a substantive. It is also the knowledge by which we recognize three syllables and ten letters in "unflemming". In general terms, it is the knowledge by which the phenomena of X are identified.
Our knowledge of language X in either acceptation, however, is useless for dealing with language Y, unless X and Y have at least some features in common. Thus a knowledge af Dutch is useful to understand Hawaiian only insofar as that they have some, if few, features in common: both use the Roman alphabet, and common letters are pronounced rather alike. On the other hand, a knowledge of English is utterly useless for Chinese: armed with it we can tell what "goes an" in an English sentence, but nothing in a Chinese sentence, not even where words end and start (there are no spaces to indicate breaks between words), not even if it is language at all.
An algorithm which identifies a linguistic phenomenon provides a convenient definition of that phenomenon in the general form:
"Phenomenon P is the one identified by algorithm A when applied to any text in any language".
We shall call such an algorithm a "decipherment algorithm".
Definitions of linguistic phenomena by their decipherment algorithms have very desirable properties:
The importance of decipherment algorithms far linguistic theory, then, can hardly be overemphasized - if they exist.
If decipherment algorithms do exist, some of their properties can be inferred:
If an algorithm B needs information produced by another algorithm A, it is dependent upon A, and the phenomenon P(B) which it defines cannot be identified until the phenomenon P(A) defined by A is identified.
For a set of decipherment algorithms to be efficacious, its elements must be efficacious.
If an algorithm A depends for all its information upon another algorithm B, which in turn needs all its information from A, then the phenomena which they define cannot be identified. Therefore A and B are not efficacious and must be rejected.
For a set of decipherment algorithms to be efficacious, it must be at least effective, and therefore finite.
Therefore there must exist at least one algorithm in the set Which needs no input from any another.
One can imagine the following hierarchy of decipherment algorithms in an efficacious set applying to the written word:
and so on.
Although the feasibility of algorithm (1) cannot be doubted, none has been proposed to date.
The feasibility of algorithms (3) to (5), on the other hand, can be doubted, and this is why some algorithms will be elaborated and discussed here, which should go some way towards dispelling doubts about the feasibility of high order decipherment algorithms.
It has been seen that the definition of a linguistic phenomenon and the decipherment algorithm by which it is identified are one and the same.
The list of the distinctive features of a linguistic phenomenon (that is, the properties of this phenomenon which enable us to identify its occurrences in a text) is also a definition of this phenomenon. But whereas the decipherment algorithm that identifies a given phenomenon is not necessarily unique, the list of its distinctive features is. In this respect, a definition of a linguistic phenomenon consisting of the list of its distinctive features is more general than one consisting of its decipherment algorithm (or one of its decipherment algorithms). Lists of distinctive features, however, generally do not have the desirable properties of decipherment algorithms, least of all efficacy.
There are two kinds of distinctive features: text dependent and text-independent.
A list of text-independent features specifies a set of possible interpretations of a text, ANY text. Those possible interpretations constitute the set of ACCEPTABLE SOLUTIONS to the problem of understanding the text.
Text-independent features are therefore definable without computation on any text.
How well each of those acceptable solutions fits any particular text is determined by properties this text, that is, by text-dependent features. "How well" is then an objective function on the set of acceptable solutions.
The set of acceptable solutions and the objective function must be defined so that the phenomenon which they identify should correspond to an element of the set of acceptable solutions for which the objective function is minimum or maximum. The objective function is computed from the text analyzed.
Definitions of linguistic phenomena by distinctive features therefore, at this stage, comprise two parts:
The two parts must provide enough information for devising a procedure which identifies the acceptable solution(s) for which the objective functions reache an extreme. This identification procedure is the decipherment algorithm which constitutes the third part of the definition of the phenomenon.
Given a set of acceptable solutions and an objective function, a large number of algorithms may be devised which find an acceptable solution for which the objective function reaches an extreme. Such algorithms are MATHEMATICALLY EQUIVALENT.
Two objective functions that reach the same extreme for the same element of a given set of acceptable solutions are LINGUISTICALLY EQUIVALENT.
A definition of a linguistic phenomenon is MATHEMATICALLY CORRECT if its decipherment algorithm yields the absolute extreme for its objective function calculated for its set of acceptable solutions.
A definition of a linguistic phenomenon is LINGUISTICALLY CORRECT if its set of acceptable solutions and its objective function are such that the (intuitively) right phenomenon is effectively identified.
Those distinctions provide a basis of collaboration between mathematicians and linguists.
It must be emphasized that almost all the algorithms to which we have had access are mathematically incorrect: they cannot be garanteed to find the extremes of objective functions.
The results obtained are not linguistically perfect, which should be expected given the mathematical imperfection of the algorithms, even though some are of surprisingly high quality.
Translation and comments by Jacques Guy.
In a number of writing system a single symbol corresponds, not to a single sound, but to a string of sounds (e.g. Linear B and Japanese).
The following algorithm is designed for such writing systems, where a single symbol may represent a consonant followed by a vowel, or a vowel alone, or a consonant alone (it also has interesting morphological applications, which shall be discussed later).
The idea behind the algorithm is as follows:
Let S be the set of the symbols of the syllabary: S = { s[i] }, e.g. S = { pa, pe, pi, po, pu, ta, te, ...}. Two equivalence relations in S, E1 and E2, must be found. E1 is the relation of symbols containing the same vowel (e.g. pa, ta, ka, da, ... ), E2 the relation of symbols containing the same consonant (e.g. da, de, di, do, ... ).
Let the equivalence classes of E1 be represented by [a1], [a2], ..., [an], and those Of E2 by [b1], [b2], ..., [bn]. An ordered pair ([bi),[aj]) can now be associated to each symbol s[k] of the syllabary S, and the set of the symbols [a1], [a2], ..., [an], [b1], [b2], ..., [bn] can be considered to represent a phonemic alphabet A corresponding to the syllabary S. The alphabet A may contain symbols corresponding to no consonant (e.g. a) or no vowel (e.g. p).
El and E2 cannot be separately constructed; the pair E1, E2 must have a certain property. Any pair of relations having this property constitutes an acceptable solution.
If a relation E1 has m classes, no class of the other relation E2 can have more than m elements.
Suppose that El had only two classes: the class of symbols having "a" for vowel and the class of symbols having "i". Consider the class of the symbols of E2 having "p" for Consonant: it can contain only symbols corresponding to "pa" and "pi".
Note that a new assumption has just been made: that the syllabary contains no homophonous symbols. Consequently, if two symbols belong to the same class of El, they necessarily belong to different classes of E2, and vice versa.
Consider the matrix T0 = f(s[i],s[j]) of the absolute frequencies of occurrence of ordered pairs (s[i],s[j]). Row i shows which symbols of S occur following s[i], and in what proportions. Suppose that s[i] has been deciphered as ([bk], [al]). It is reasonable to expect the contents of row k to depend more on the occurrences of [al] than on those of b[k]; rows corresponding to symbols representing strings of sounds ending with [al] should therefore be similar. Conversely, columns which correspond ta strings of sounds starting with the same consonant must be similar.
The similarity, or rather, dissimilarity, of rows and columns can be estimated very naturally by the distance between points in an n dimensional space, leading to the fallowing formulas for rows (d1) and columns (d2):
n d1(s[i],s[j]) = Sum Abs( f(s[i],s[k]) - f(s[j],s[k]) ) (1) k=1 n d2(s[i],s[j]) = Sum Abs( f(s[k],s[i]) - f(s[k],s[j]) ) (2) k=1
Let Tl and T2 be the matrices of distances thus calculated, i.e. T1= d1(s[i],s[j]) , T2 = d2(s[i],s[j]).
Under our hypothesis the distance between symbols of the same class must be small. Therefore, for a classification (i.e. an ordered pair of equivalence classes in the syllabary S) of syllabic symbols to be "good", the sums of the main diagonals of T1 and T2 must not be too high. Call those sums Sum1 and Sum2. Now, for any "good" classification even the worse element of the pair cannot be too bad. This observation leads to the following estimator of the quality of a classification (El,E2):
Q(E1,E2) = max(Sum1,Sum2) (3)
Formula (3) suggests that the best classification (E1,E2) is the acceptable solution for Which Q(E1,E2) is minimum.
Let us now turn to another property of acceptable solutions.
If a syllabary consisted of all possible combinations of a consonant followed by a vowel, its cardinality would be the product of the cardinalities of El and E2. Since syllabaries rarely consist of all possible combinations this property is almost never encountered in texts.
A consequence of this situation is that the product of the cardinalities of the best classification less the cardinality of the syllabary is equal to the number of "non occurring" symbols. Obviously, the less the number of such symbols, the better the classification. Therefore card(E1) card (El) - card(S), or, which is equivalent, card(E1)+card(E2) must be mininum.
(* The description of the decipherment algorithm itself is so obscure as to be completely incomprehensible, at least in the French translation. Nevertheless, from the two criteria for the specification of the objective function above, one can infer that it aims at minimizing both Q(E1,E2) and card(El)+card(E2) using what seems to be linear programming. Sukhotin concludes: *)
It should be noted that this is a rather complex algorithm. It was tested manually on restricted abstract examples but was not tried on computer.
The algorithm has an interesting morphological interpretation.
Each word of a text can be considered to belong to one class of some "grammatical classification" and to one class of some "lexical classification". Appartenance ta a grammatical class is sometimes, but not always nor necessarily, marked by some grammatical affix. When it is not, a zero affix may be posited. Appartenance to a lexical class is marked by the presence of a root. Note that the presence of a root is not Obligatory either (it can be doubtful, as in Latin trans-i-re). This situation is parallel to that found in syllabaries, each syllabic symbol belonging to one class of the symbols having the same vowel and to one class of symbols having the same consonant. We could say, by analogy, that syllabic symbols are inflected, vowels corresponding to flections, consonants to roots, and that a class of E2 is the paradigm of its consonant.
Matrix T0 is then the matrix of conditional probabilities of the occurrence Of a word v[j] in a sentence containing word v[i]. Words which are lexically similar should all be expected to condition other words in the same ways and words which are grammatically similar to be conditioned in the same ways by other words.
The interest of such a morphological analysis model lies in the fact that it allows the discovery of zero morphemes, which the algorithm for partitioning a text into morphemes could not do. This observation suggests the following strategy in deciphering a text:
First, submit it the partitioning algorithm. Nan zero morphemes will thus be identified.
Second, submit the partitioned text to this algorithm, to determine the classes to which they belong, in order to identify zero morphemes.
Thirdly, submit the result to an algorithm which combinations of morphemes (other than zero) constitute words.
Finally, submit the result to this algorithm again, to obtain the final morphological classification.
It must be acknowledged, however, that this algorithm seems to be far too expensive computationally to be practical for long texts, and would have to be revised.
by B. V. Sukhotin
translation and comments by Jacques Guy
(* The algorithm effectively partitions a text into its morph components, not morphemes, but it will turn out to be far more powerful than that: it can be expected to do a morphological analysis of an unknown text. "Morph" should be read for "morpheme" whenever the latter occurs in the text *)
We will assume that the text to be partitioned has no spaces to separate words and we also assume that the spelling used is phonemic.
(* The first assumption makes the problem more general and somewhat more difficult. There is no need for the other: the algorithm will apply to strings of ideograms as well as to strings of phonemes equally successfully *)
As in the case of the consonant/vowel algorithm, we start by defining the set of acceptable solutions.
A partition of a text into disjoint subsets is an acceptable solution.
A few definitions need to be introduced before going further.
A text is a mapping of an alphabet A (a set of letters) onto a set of locations L. In other words, a text is a set of 2 tuples of the type (a,p) where a is a letter and p its position in the text.
A string Str(s,t) (where s and t are integer numbers and s>=0) is a set Of 2 tuples of the type {a[i],p[j]} in which s < p[j] <= t. A text T is a string Str(0,p), p being its length.
(* The notation is somewhat unclear; a[i) is presumably the ith letter of the alphabet, p[j] the position in Y of the jth letter Of Str(s,t). That definition of a string allows for disconnected strings (eg. "dfton" is a string of the text "definition": {('d',1), ('f',3), ('t',7), ('o',9), ('n',10)} ). Whether Suhotin actually intended it to be so is unclear. The definition of a text as a string Str(0,n) of length n is counterintuitive; Str(l,n) would seem more appropriate. It does allow for a null text Str(0,0), yet there is no point is partitioning a null text *)
A partition P of a text Str(0,n) is a set of strings {Str[i](s,t)} such that to each string Str[i](s,t) for which t <> n one can associate a string Str[i+l](t,v). Str(s,t) and Str(v,w) have a non empty intersection if s < v < t < w or if v < s < w < t. (Since strings are sets, we can use the terminology of set theory and say that a text is included in another or that two texts have a nan empty intersection.)
Suppose that some relation of similarity on the set of strings of text 'I' has been defined.
A morpheme m is the set which contains some string Str(s,t) of T and all the all strings of T similar to Str(s,t).
Let M be the set of the morphemes of T. Given an arbitrary partition P of T, some strings of m are also strings of P, and some are not. We say that the strings of m which are also strings of P are true occurrences of the morpheme m relatively to P, and those which are not are false occurrences of m.
If the partition P does not contradict aur intuition we can say that those occurrences are true or false absolutely rather than relatively to a partition.
For instance, "pint" is a true occurrence of the morpheme "pint in 1) below, but it is not in 2), and "ilk" is a false occurrence of the morphem "ilk" in 1):
A morpheme is true if there is at least one absolutely true occurrence of it in T.
A definition of "acceptable solution" now emerges: an acceptable solution consists of:
Let us now turn to the problem of defining a similarity relation.
A possible definition is:
Two strings Str(s,t) and Str(v,w) are similar when one can
be obtained from the other by translation, that is, if
Str(s,t) = { (a[x],p[j]), (a[y],p[j+1]), ... } and
Str(v,w) = { (a[x],p[j+c]), (a[y],p[j+1+c]), ... } in which
c is an integer constant.
A morpheme m is then conveniently represented by the string Strm(0,n) of length n, similar to any string of m and consisting of 2 tuples (a[x],p[y]) in which 0 < p[y] <= n. A morpheme m[i] alphabetically contains another m[j] if the string Strm[i] (* which is a set *) contains the string Strm[j].
Given the notion of similarity as introduced above, a partition uniquely defines a set of morphemes and the set of acceptable solutions is therefore the set of partitions.
Having defined the set of acceptable solutions, we still have to find an adequate objective function, then an optimal solution. But we are now confronted with an interesting phenomenon. Consider the following examples:
T: innumerabilis P1: |in|numer|abil|is| P2: |in|numerabil|is| P3: |in|numerabilis|
It is apparent that the three partition are all, in a way, "right"; P1 only seems finer than P2 and P2 finer than P3. Nor can this partition of a longer text be considered wrong:
P: |innumerabilis|annorum|series|et|fuga|temporum|
This is all the more remarkable that the strings which compose P2, P3, and P are linguistically meaningful objects: they are roots, words, or more simply, combinations of morphemes which behave like isolated morpheres. This is the reason why partition P, for instance, is not merely an approximation of P1; it has its own significance because the strings which it evidences are words. It is obvious too that P is more meaningful than P4 below, even though P4 is also composed of strings which are combinations of simple morphemes:
P4: |innumer|abilisannor|umserie|set|fugatempor|um|
We obtain thus a model of morphological analysis consisting of a series of coarser and coarser partitions, each one of which is right in its own way, and can be considered to be the approximation of other, finer, ones.
We say that Pi is coarser than Pj if any string of Pj is included in a string of Pi, but no string of Pj is included in a string of Pi.
An acceptable solution, which is at the same time a morphological analysis, is then a series of partitions P0, P1, ... Pj, ... Pn, each one of which (P0 excepted) is finer than the previous one.
Such a morphological analysis is conveniently represented by a tree (similar to that of immediate constituent analysis), or by bracketing, eg.:
1) .----------------. | | | .--------. | | | | .------. | | | | | .--. .-----. .----. .--. |in| |numer| |abil| |is| 2) ((((in)))(((numer)(abil))((is))))
A morphological analysis is of course an operation more complex than a partition. It corresponds, however, to a more simple definition of a morpheme:
A morpheme is a set of similar strings the intersection of which is empty.
It follows from this definition that any morphological analysis includes the finest partition Pn, into isolated letters, since there is no one letter string which does not have a non empty intersection with another string. Likewise, it includes the coarsest partition PO, which is the whole of the text.
This new definition of an acceptable solution does not entail a particularly more complex recognition procedure.
Note, however, that the definition of similarity is clearly wanting. It prevents the algorithm from coping with
This is nothing disastrous, however. The first point at least (* sandhi type morphemic alternances *) can be fixed with a finer definition of similarity relation.
(* Probably not. It is difficult to conceive how a similarity relation allowing for sandhi could be other than language specific. But if we read "morph" instead of morpheme, the problem disappears : the algorithm partitions a text into its morphs; another algorithm must be devised to group them into morphemes *)
Before we turn to the objective function, we must generalize some notions already mentioned and introduce new ones.
A string Str which is a member of a morpheme m is a true occurrence of m if it is also a member of at least one partition of the morphological analysis A (or in other words this string is true for that particular analysis). Consider now a partition of this string. This partition is immediate if it consists of true occurrences the number of which is minimum.
Since the number of different possible morphological analyses of any text of reasonable length is so large, the definition of morpheme elaborated above is wanting. It lacks a criterion of quality.
A morpheme is set at similar strings which occur REPEATEDLY in a text. Its members must then be, in some way, stable (this, however, is only true Of those strings which are true occurrences Of the morpheme).
Suppose a measure of stability defined. The value of any given morphological analysis can then be estimated by the sum of the stabilities of the morphemes identified in the process. If the measure Of stability is good, the higher the sum, the better the analysis, with the best analysis yielding a maximum sum.
Let us consider some possible measurements of stability:
(* Point b) possibly inspired from Zelig Harris' s "From phoneme to morpheme" in Language, v.21 No.3, 1945 *)
Call the first property (a) internal stability, the second (b) external stability.
Consider the internal stability Of a string Str(0,6) "passer"
Suppose that the immediate partition of this string is of the form:
(p) (a) (s) (s) (e) (r)
List all the possible partitions of this string into two connected substrings:
P1 (p)|(a) (s) (s) (e) (r) P2 (p) (a)|(s) (s) (e) (r) P3 (p) (a) (s)|(s) (e) (r) P4 (p) (a) (s) (s)|(e) (r) P5 (p) (a) (s) (s) (e)|(r)
It is clear that wherever "passe" occurs in a Latin text, "r" is likely to occur next, and that wherever "asser" is found, "p" is likely to precede.
Represent the string "passer" onto which partition Pi has been effected by Si,S'i in which S is its left part and S'i its right part.
Let the degree of attraction of S'i by Si be measured by the fraction f(Si,S'i)/f(Si) and that of Si by S'i by f(Si,S'i)/f(S'i), f being here either the relative or the absolute frequency.
(* That is, the degree of attraction of the right part by the left part is equal to the number of occurrences of left and right parts together divided by the total number of occurrences of left and right parts together divided by the total number of occurrences of the left part, with or without the right part *)
The highest degree of attraction is 1 and the lowest is 0.
Let the internal stability of the string be the mean attraction of all its partitions and strings of length 1 (single letters) have a stability of 0 by definition.
External stability can be estimated roughly on a binary scale: complete stability and complete instability. Now, the absolute frequency of a morpheme m represented by a string S which alphabetically includes another, longer string S' itself representing a morpheme m' cannot be greater than that of m'. Hence, if m and m' occur with exactly the frequency in the text, m' is, as it were, "immersed" in m. By definition, the external stability of a morpheme is 0 if it is "immersed" in another, otherwise it is 1.
Let the stability of a morpheme be the product of its internal and external stabilities, and the stability of a morphological analysis the sum of the stabilities of its morphemes.
(* It follows from those definitions that the stability of the finest (single letters) is zero. This is intuitively agreeable, for this partition is trivial *)
The optimal morphological analysis of the text and the list of its true morphemes are obtained simultaneously.
A list of all the morphemes of a text is a gross approximation of the list of its true morphemes (it is much longer and contains many morphemes which are mostly false occurrences). Call this first approximation L0.
A better approximation, L1, is obtained by eliminating morphemes of stability 0.
L1 is much shorter L0. Unfortunately at this stage L1 lacks true morphemes of zero stability but re-entering them later is quite easy. L1 is convenient, for it contains all the morphemes the frequency of which is necessary to calculate the stability of any morpheme.
The algorithm consists of five steps.
Build a list L2 of all the morphemes of absolute frequency greater than 1, with each entry in the list consisting of the string representing the morpheme, and of its absolute frequency.
Since any morpheme with an absolute frequency of one has a stability of zero L2 is shorter than L0 and longer than L1 as far as morphemes with a stability of 0 and a frequency greater than 1 (* eg. one-letter words *) are concerned.
L2 can be obtained by the following procedure:
a) List the alphabet Of the text, calculate the absolute frequency of each letter. Remove entry with a frequency of one. This list is the first fragment, fragment of rank 1, of L2. Represent it by L2(1).
b) Given a fragment of rank n, the fragment of rank n+1 is obtained as follows: build the strings of the form Str(0,n)+Str(n,n+1) in which Str(0,n) is a string of L2(n) and Str(n,n+l) a string of L2(1); n, transposed, moves to the right; calculate their frequencies and eliminate the strings with frequencies not greater than 1.
The procedure terminates when we obtain L2(n) for which L2(n+l) is empty. The list L2 is Of the sum of these fragments.
(* In far simpler terms: given a text of length n, list all the substrings of it from length 1 to n which occur more than once *)
Calculate the stability of each string of L2.
(But remember that at this stage only strings of length 1 can be considered to constitute true occurrences of morphemes.)
This step is based on the assumption that ail the occurrences of the morpheme with the highest stability (greater than zero) are true. Granted this assumption, no string which has a non empty intersection with an occurrence of this maximum stability morpheme can be a true occurrence of a morpheme.
Therefore, select the morpheme of L2 with the highest stability, and bracket its occurrences in the text (* thus carrying out morpheme recognition and morphological analysis concurrently *)
The frequency of every morpheme Of L2 with a non empty intersection with N occurrences of the selected morpheme is reduced by N. Some false occurrences are thus discarded. When during this process the frequency of any given morpheme falls below 2 it is eliminated from the list.
Since the status (true or false occurrence) and the frequency of strings varies during the procedure, the stability of the rest of the remaining morphemes is then calculated anew.
The process is repeated until only morphemes with zero stabilities are left.
We now have a partition of the text expressed by bracketing. This partition, however, does not yet represent a morphological analysis.
But first a few definitions.
A pair of brackets which contains no other brackets is of rank 1. A pair of bracket which contains other brackets, the highest rank of which is i-1, is therefore itself of rank i. A segment of text which is not between brackets of rank 1, does not contain brackets, and is between brackets of any rank not necessarily forming a pair, is considered to be between brackets of rank 0 (not represented in the text). A segment between two brackets of rank i is a segment of rank i. A pair of brackets is AUTONOMOUS for rank r if the lowest rank of the brackets by which it is contained is r.
Find the maximum bracket rank of the text. Call it r. Consider the text to be between (invisible) brackets of rank r+1. Enclose all brackets pairs autonomous for rank r+1 with new bracket pairs until the rank of the enclosing brackets is equal to r.
Repeat for all segments of rank r down to 2.
The morphological analysis -- represented by bracketing segments of the text -- is now complete.
The segments which appeared in brackets for the first time at step 4 are true occurrences of zero-stability morphemes (eg. morphemes having only one string for member) and are added to L2.
By B.V. Sukhotin translated and adapted by Jacques Guy
Let us call INFORMATIVE texts meant to be deciphered.
Note that this concept is different from that of "meaning" proper, for a text which carries a meaning is not necessarily meant to be deciphered without a key. Consequently, meaningless and undecipherable texts, by definition, cannot be informative.
Since encryption is relatively rare, a characteristic feature of texts in a natural language will be their high degree of INFORMATIVENESS. I shall not attempt to formalize this concept any further, as I only wish to suggest directions for further research.
Take for instance a problem very similar to the partitioning of a text into its constituent morphemes. Imagine a text encoded using n-tuples of integers, so that each letter corresponds to one n-tuple. Suppose n known. The problem is: find the beginning of the n-tuples.
It is obvious that (supposing the text circular or its beginning and end missing, so that the solution is not trivial) there are only n classes of starting positions, since each starting point is equivalent to the one obtained moving nm positions left or right (m being an integer). Thus the set of acceptable solutions is defined: it is the set of partitions of the text into n-tuples, and this set has n elements.
How can the quality of such a partition be judged?
A random text can be considered to be a sequence of symbols produced by randomly drawing symbols from a box. The probabilities of the different symbols must be considered equal and so must be those of the corresponding n-tuples (if they were not, there would be less reason to consider the text to be completelty random). It seems obvious that an informative text must differ considerably from a random text. Several functions can be proposed to evaluate its degree of randomness, for instance, the sum of the absolute deviations from the mean probability, the sum of their squares, or the first-order entropy of the text, or its infinite-order entropy.
Choosing between those functions is difficult, even though infinite-order entropy seems best. At any rate, since they are probably all roughly equivalent for the use to which we are about to put them, the simplest one will be used here.
Let us encode the Roman alphabet into ternary 3-tuples, i.e.: a=000, b=001, c=002, d=010, e=011, f=012, g=020, etc...
The short Latin text "Donec eris felix, multos numerabis amicos, tempora si fuerint nubila, solus eris" becomes:
010111110011002011121022122012011101022210102201101200111122 110201102011121000001022122000102022002111122200011102112111 121000122022012201011121022110200110201001022101000122111101 20112201121022122
We have:
Number of 3-tuples occurring N times in partition P1 P2 P3 N 0 10 7 6 1 3 6 5 2 3 5 5 3 1 3 4 4 4 0 0 5 3 2 2 6 0 1 2 7 1 2 1 8 2 0 1 9 0 0 0 10 0 0 0 11 0 1 0
The objective function, here the sum of the absolute deviations from the mean frequency, should be very small for a truly random text. Now, since an erroneous partition should resemble a random text more than a correct partition should, the maximum value of the objective function should correspond to the correct partition.
In fact, the sum of the absolute deviations from the mean frequency is 60 for partition P1, 53 for partition P2, and 49 for partition P3, and the maximum sum, 60, does correspond to the correct partition, P1.
But the four possible objective functions suggested above take an extreme value when symbols are cyclically repeated throughout the text. Such texts are clearly not what our intuition sees as informative. Therefore, none of these functions provides a measure of the informativeness of texts.
Assume now that a mostly random text is given. Whichever two possible partitions are considered, the corresponding values of the objective function (whether it is the sum of the absolute deviations, of their square, or the entropy of any order) can be expected to be very close.
If, on the other hand, a mostly non-random text is given, the values of these same functions calculated on any two partitions will be just as close as for a random text.
The above observations suggest that, when an informative text is encoded as in our example, the difference of quality between the best and the worst partition can only be infinitely small.
Consider now a text somehow encoded.
The resulting encoded text can be considered to be one particular interpretation of the original.
Assume that some objective function has been defined by which the quality of each possible interpretation can be evaluated.
Then, it possible to produce a "very good interpretation" of an informative text, as well as a "very bad" one.
Now, it seems intuitively true that an informative text can be either "well" understood or "badly" understood. On the other hand, a random text can only be "badly" understood, and an obvious text (a periodic sequence of symbols, for instance) can only be "understood". Here, "understanding" means being able to predict what follows on the evidence of some part of the text.
If so, it would seem natural to try and evaluate to what degree a text is informative from the difference of quality between its best and its worst interpretation:
Q(Informativeness) = Q(Best interpretation) - Q(Worst interpretation)
If "partial interpretation" is defined as "acceptable solution of a decipherment algorithm", it is now possible to answer this question: what is the property of a text which maximizes its partial informativeness? (by which is meant the difference between the quality of the best and the worst acceptable solution). The solution to this problem must lie in the computation of the distributions of the n-tuples of symbols for a class of interpretations which maximize partial informativeness, or in the computation of some statistical properties of those interpretations, such as moments or redundancy. Those are the properties by which it should be possible to ascertain to what degree a text is possessed with the property of being informative.