LETTER SERIAL CORRELATION POINTS TO THE COMMON DESCENT OF NATURAL LANGUAGES
By Mark Perakh
This essay has been called to life by Steve Reuland’s post to Panda’s Thumb weblog titled “What good is half an underlying language structure?” ( www.pandasthumb.org/pt-archives/000853.html ) which refers to Carl Zimmer’s posts to Loom (http://www.corante.com/loom/archives/2005/02/25/building_gab_part_one.php and http://www.corante.com/loom/archives/2005/03/01/building_gab_part_two.php ). One point touched upon in passing by Zimmer and by some comments’ writers, was the question of whether or not natural languages have all evolved from the same proto-language.
Such an idea was, in particular, strongly pushed by Academician Nikolay Marr in the USSR. For some 30 years Marr had been acclaimed in the USSR as the greatest linguist of all times, whose teachings were supposedly in full agreement with Marxism-Leninism. Then, suddenly, in 1950, Stalin changed his mind and millions of copies of a thin booklet were published whose author was claimed to be Stalin himself. It explained what the “genuine Marxist linguistics” is. In this booklet Marr was claimed to be a pseudo-scientist and his theory denounced as anti-Marxist.
As I understand, the notion of a single proto-language is shared by many linguists.
I would like to briefly report on some data which, I believe, provide strong empirical support to the notion of the intrinsic unity of all natural languages, specifically evident in their written form. The experimental data in question have been obtained in a work which was conducted a few years ago by myself and Brendan McKay of the Australian National University (Canberra).
We developed a new method for a statistical analysis of texts dubbed Letter Serial Correlation (LSC). Although we have conducted hundreds of measurements on many texts in 12 languages as well as on a number of gibberish strings created in various ways (and also on the famous Voynich manuscript often referred to as the “most mysterious manuscript in the world”), so far our results have only been reported in a series of articles on my personal website ( http://members.cox.net/marperak/Texts). Twice during recent years we had a paper prepared for an international journal on computational linguistics with a concise presentation of our method and the results obtained, but in both cases we opted for postponing the planned publication because each time some new modifications improving the method came to mind. Besides, both myself and Brendan have been busy with other projects and did not devote as much time to LSC as it, perhaps, deserved.
The data obtained by the LSC method demonstrated the intrinsic unity of the structure of all studied languages (Hebrew, Aramaic, Greek, Latin, English, German, Italian, Spanish, Russian, Czech, Finnish, and Yiddish). Most of the biblical texts (both in Hebrew and in translations) as well as such diverse texts as Moby Dick, The Song of Hiawatha, Macbeth, UN convention on Sea Trade, Tolstoy’s War and Piece (in the Russian original and in translations), the full text of a Russian newspaper, and many others, were studied.
I believe our data vividly show that all meaningful texts, regardless of language, authorship, etc, have the same intrinsic structure, in particular reflected in the existence in all meaningful texts of what we called the Average Domain of Minimal Letter Variability (ADMLV). Gibberish strings, both highly ordered and highly disordered, do not possess this feature.
Briefly, the method of LSC is as follows (I’ll describe the latest version which slightly differs from that reported in the articles posted to my site.) Our computer program performs several actions on a text which is stored on a disk, namely: (1) It counts the total numbers Mi of each letter’s occurrence in the entire text. (2) It chooses a “window” in the text which is n letters long, where n is an even number varying from 2 to L/2 if L is an even number, or to (L-1)/2 if L is an odd number (L is the total length of the text expressed in the number of letters). Each “window” is divided into two equal “panes” 1 and 2, of a length of m=n/2 each. For each value of n the program counts the numbers Xi1 and Xi2 of occurrences of each letter in both panes 1 and 2. The window is moved along the text and for each window’s position the program calculates the expression (X1 – X2)2 for each letter. Then the program calculates the sum Sm of all such expressions over all positions of the window and over all letters of the alphabet.
The program generates a table where the values of Sm - the Measured Serial Correlation Sum, are listed for all values of n. Finally the program plots the graph of Sm vs n. Simultaneously, the program computes the Expected Serial Correlation Sum (Se ) as a function of n, using the theoretical formula we derived based on a random distribution of letters.
Although on my site the results are shown obtained by an earlier version of the method (where the window was not moved along the text; instead the program divided the text into k equal “chunks” and measured the sums for each pair of adjacent chunks) the results obtained by both versions differ only in secondary details; the newer version removes a certain inconvenience in the original method and generates a smoother curve, but does not generate principally different “Sm vs. n” curves).
The “Sm vs. n” curves for all meaningful texts in all studied languages had quite a distinctive shape, with a number of characteristic points which were absent in the graphs for gibberish texts. Many of such graphs can be seen on my site at http://members.cox.net/marperak/Texts .
One of the characteristic points seems to be of special interest. It is a distinctive deep minimum on the “Sm vs. n” graph which is present on all such curves for meaningful texts regardless of language, authorship, etc., but does not exist on the curves for gibberish texts (and, as expected, does not exist on “Se vs. n” curves).
This minimum testifies to the existence in meaningful texts of a distinctive Average Domain of Minimal Letter Variability. This is a text’s length, within which the distribution of letters frequencies is characterized by a maximal frequency of occurrence of the same subset of letters. Within the text’s length which is either shorter or longer than the length of the ADMLV, the variability of letters’ occurrences is larger than within the ADMLV’s length. Details of the measurements, calculation, and interpretation of data, can be seen at my site.
The length of ADMLV differs depending on language but varies only in a narrow range for different texts in the same language. For example, for all Hebrew and Aramaic texts, both biblical and secular, the length of ADMLV is invariably between 42 and 46 letters. In English texts the length of ADMLV varies between 60 and 140 letters, which corresponds to a certain extent to the difference between these two writing systems – in Hebrew there are no letters for vowels so the text’s portion in Hebrew containing a certain amount of a message necessarily comprises fewer letters than a corresponding segment in English.
The natural interpretation of the ADMLV is that it represents the average length of texts wherein a specific topic or notion is the subject of the narrative and this predetermines a relatively high frequency of repeated occurrences of the same letters.
The existence of ADMLV, which finds its empirical reflection in the minimum on the LSC curves, seems to be an ineliminable feature of all meaningful texts, regardless of language. It testifies to the deep unity of various languages and supports the notion of all languages’ evolution from the same proto-language via descent with modification.
There seems to be analogy between biological evolution and that of languages. The evolution of languages is a fact – for example, today’s English is so different from that of Chaucer’s that nobody in his right mind could deny such an evolution. I guess the creationists would say this is “microevolution,” as Chaucer’s English and today’s English both are still English. And what about, say, Latin and its descendants – Italian, French, Spanish, Portuguese, Romanian, etc.?
While the fossil record, for obvious reasons, necessarily is incomplete and has many gaps, the evolution of a language is often well recorded in all of its stages because of the preservation of written texts.
There is no principal difference between evolution of a language from Chaucer’s stage to today’s stage and evolution resulting in the emergence of a new language – Italian from Latin, or Russian and Ukrainian from Old Slavic (two different languages stemming from the same “progenitor,” the separation of which occurred around 11th – 12th centuries) or Czech, Polish, Bulgarian, Serbian-Croatian, and Macedonian from an even earlier proto-Slavic. The difference is in degree, so that evolution of a language can naturally graduate into evolution to a new language, no longer understandable to the speakers of the original language, provided the two groups of speakers are geographically separated. Likewise, there are no reasons why evolution within a species cannot extend to the loss of interbreeding ability of two geographically separated subspecies thus resulting in the appearance of a new species, i.e. in “macroevolution.”