An article appearing in the June 2014 issue of Language calls into question results previously published in several prestigious international journals, including Science. The Language article, “A statistical comparison of written language and non-linguistic symbol systems,” was authored by Richard Sproat, a Research Scientist at Google. The LSA posted a news release raising awareness of the Sproat article earlier this week.

Sproat’s analysis comes in response to a number of papers which argue that statistical analyses of symbol combinations can provide insights into the origins of written language. One paper which appeared in the journal Science in 2009 argued that a particular statistical measure — bigram conditional entropy — showed that the Indus Valley symbols behave more like those in linguistic texts than those of non-linguistic systems. Another paper published in Proceedings of the Royal Society claimed that a more sophisticated set of entropic measures put Pictish symbols in the same category as linguistic texts.

As part of his work on whether symbol systems such as these exemplify written language, Sproat developed large, structured corpora from a variety of non-linguistic systems, both ancient and modern, including Mesopotamian deity symbols (Babylonia), Pennsylvania barn stars ("hex signs"), weather forecast icon sequences from www.wunderground.com, and Unicode characters for Asian emoticons. He compared these to corpora developed from fourteen languages representing a variety of ancient and modern writing-system types.

From the point of view of the measures that had been proposed in the previous literature, all of the non-linguistic symbol systems in Sproat’s collection or corpora behaved the same as the linguistic systems. However, he also found that a novel measure of the amount of local repetition and a version of previously examined entropic measures could accurately distinguish two different categories of symbol systems. Moreover, his statistical procedure, unlike the earlier ones, classifies both the Pictish and Indus Valley symbols as non-linguistic.

Despite these promising results, Sproat cautions against relying too heavily on statistical measures to analyze ancient symbol systems that have not been deciphered. He argues that a truly reliable demonstration that a collection of symbols exemplifies written language requires supporting empirical evidence, such as a credible decipherment or independent archeological evidence of a related culture of active literacy. What is clear, however, is that the previously proposed statistical methods do not work for the intended purpose.

A pre-print version of the article is available for review on the LSA website. Read our news release for more information.