Do we speak languages, or do languages speak us? It seems the more we look into the regular patterns of naturalistic languages the more we bump against surprising connections, almost as if human tongues were not purely arbitrary conventions but the handiwork of an intelligent designer.
Researchers at the Massachusetts Institute of Technology have recently discovered a principle governing the length of words used in 10 Indo-European languages. Steven T. Piantadosi, Harry Tily, and Edward Gibson demonstrated that density of information content (that is, its predictability) is the best predictor of a word’s length.
The law is not altogether unprecedented, as it is actually a major improvement in the 75-year-old theory of linguist George Kingley Zipf. Zipf’s influential theory states that the words used most frequently in a language tend to be short.
A summary at the National Science Foundation website of the study’s findings explains simply:
The goal of the research was to compare Zipf’s word frequency theory to Piantadosi and colleagues’ word predictability theory–the idea that the average amount of information a word conveys in context–its predictability–determines word length.
Using an Internet database, the researchers studied how often all possible sequences of two, three or four word combinations occur together in order to estimate how predictable any word is when it’s typically written.
By knowing this, they could determine whether context and predictability were better determinants of word length than frequency of use.
“For instance, in a context like ?Monday night ____’ the word ?football’ is very predictable and therefore conveys very little information,” said Piantadosi, a cognitive scientist in the Ph.D. program at MIT and lead author of the study. “But, in a context like ?I ate ____,’ the missing word is very unpredictable, but conveys a lot of information.”
The hypothesis was that average information contained in two, three or four word sequences should in part determine the length of words, either in letters or syllables, since that’s how an optimal code would behave. In this example, “football” and the two words preceding it demonstrated the effect.
“The only way these effects can get in to the lexicon is if our linguistic systems, and the mechanisms of language change, are sensitive to communicative pressures,” said Piantadosi.
The sequences of words that people use are coded–their letters, syllables, sounds, etc.–for efficient communication and are better predictors of word length than frequency alone, he said.
“This means word sequences provide efficient codes for the meanings they convey, relative to the statistical regularities in language,” he said. “That’s our claim.”
Languages — or human lexicons, as the MIT scientists call them — are efficient forms of structured communication that are designed by “communicative pressures,” taking into consideration the statistical interdependencies among words when establishing word lengths. Every word is just the right length, selected as if by “natural law” to aid communication.
The researchers hypothesized that language works like an “optimal code” and their inquiry supports the notion that it does. It’s as if lexicons are the fruit of thousands of years of evolutionary selections in which words of the optimal length in the context of the entire lexicon are selected for survival whereas 15-letter words for “of” become extinct. Unsurprisingly, the words for intoxicants and drugs are mercifully short: beer, ale, dope, crack, smack, etc.
The study supports, I think, an attitude towards language of admiration for its hidden, enormously complex and interlaced beauty. No team of language engineers working in a clean room could have invented the sort of lexiconic complexity already present in natural languages (at least the 10 studied). Also, the fact that so many languages work with such similar predictability should caution theorists too eager to emphasize radical differences among languages rather than their universal mechanisms of operation, no matter how dimly those devices are understood.
(Hat tip: Languagehat.com for the link.)