
Why do we call a cat a “cat” and a pen a “pen”? It seems like it could just as easily be the other way around. More generally: why do any of the concepts to which we refer have the labels that they do?
For many years, the standard account in Linguistics was that the relationship between a word’s form (how it’s spoken, written, or signed) and its meaning is entirely arbitrary. This view is sometimes called the arbitrariness of the sign, and its origin is most closely associated with the 19th century linguist Ferdinand de Saussure. That’s not to say there’s no explanation at all: a historical linguist could trace a detailed etymology of the word “cat” from its roots, accounting for the patterns of borrowing and sound change that sculpt a word’s journey through time.
Rather, the point is that there’s no a priori reason to think there’s anything especially cat-like about the word “cat” (or pen-like about the word “pen”). A similar principle applies across languages: ignoring the obvious cases like cognates, we generally wouldn’t expect words for the same concept to be the same—or even particularly similar—across unrelated languages. This principle of arbitrariness is generally thought to be so fundamental that it’s often considered a “design feature” of human language.
In recent decades, however, there’s been a resurgence of interest in the idea that some form-meaning mappings are not entirely arbitrary. Some of this interest has come from a broader appreciation of linguistic diversity, and some has come from applying computational methods to discover new systematic relationships. (I say “resurgence” because—as many scholarly papers on this topic make clear—classical work on language often assumed some kind of systematicity.1) Below, I give some examples of non-arbitrariness in language, and briefly describe some of those new methods for characterizing exactly how arbitrary or non-arbitrary the lexicon is.
The obvious exception: onomatopoeia
Most people will likely be familiar with an exception to this rule: “onomatopoetic” words are those that directly imitate the sound to which they refer. Animal noises like “meow” or “moo” are probably the most well-known examples, but words like “beep” and “hiccup” are also onomatopoeic.
Of course, humans might imitate lots of sounds, but not every act of imitation constitutes an onomatopoeic word. What makes onomatopoeic words interesting is two things: first, they are indeed words (i.e., we can use them in a phrase like “the car beeped”); and second, they’re at least partly conventionalized. A cat’s meow doesn’t sound exactly like “meow”, but English speakers have converged upon a relatively stable set of sounds to communicate that concept across contexts (and cats).
Notably, onomatopoeic words aren’t exactly the same across languages. The sound of drinking might be conveyed by “glug glug” (English), “glu glu” (Italian), “kliuk kliuk” (Lithuanian), or some other sequence of similar phonemes. That’s because onomatopoeic words conform to other properties of a language, like its inventory of sounds and the rules about how those sounds combine. At the same time, it’s obvious that onomatopoeic words for the same concept are more similar across languages than you’d expect by chance.
Beyond onomatopoeia: iconicity and phonaesthemes
Non-arbitrariness doesn’t always have to manifest directly as onomatopoeia. The term iconicity refers to the phenomenon whereby the form of a word bears some structural or analogical relationship to its meaning. Iconicity includes onomatopoeia, but it’s a broader concept and is thus a bit harder to define in an intuitive way. This 2015 paper by Mark Dingemanse and others provides the following definition:
A prominent form of non-arbitrariness is iconicity, in which aspects of the form and meaning of words are related by means of perceptuomotor analogies.
In my experience, the concept is easier to illustrate with examples: often, you know iconicity when you see it.
For example, many languages have “ideophones”: words that depict sensory images or sensations, which contain sounds (“phones”) that correspond in some way to their meaning. This includes certain vowel contrasts (e.g., “i” vs. “a”) that generally correspond with magnitudes (small vs. large); repetition or duplication of word segments corresponding to the implied duplication of an object (or repetition of an object); vowel lengthening corresponding to a “longer” event; and much more (the same 2015 paper includes a tabular summary).
One particularly illustrative example of iconicity is the so-called “bouba/kiki effect”. Asked to match a nonsense word like “bouba” or “kiki” to a shape, humans show systematic preferences: words like “bouba” are matched to round, “softer” shapes, while words like “kiki” are matched to sharper, “spikier” shapes. Variants of this effect have been reproduced across languages with different writing systems, for participants of different ages, using a number of different stimuli (e.g., malema vs. takete).

Another form of non-arbitrariness is systematicity: here, there’s no intrinsic reason to expect a particular form-meaning relationship, but it’s still more common than you’d expect under a purely arbitrary system. A good example are phonaesthemes, clusters of sounds that consistently pair with particular meanings. English contains a number of phonaesthemes: words for light or vision often start with “gl-” (glimmer, glitter, glisten, glare, etc.), words about movement often start with “fl-” (fly, flutter, flicker), words about the nose or mouth often start with “sn-” (snout, snarl, sniffle, snort, snot), and much more.
How do researchers measure iconicity?
Non-arbitrariness has come back into style for a couple reasons. First, there’s greater appreciation of linguistic diversity, and thus more recognition of phenomena like ideophones that happen to be relatively rare in English. Second, the availability of language resources (like text corpora or lexicons), the development of online recruiting platforms like Prolific, and improvements in computational methods have made it possible to quantify sources of non-arbitrariness at a scale that simply wasn’t possible before.
One strategy for quantifying non-arbitrariness is to ask humans directly about how iconic different words are. One 2017 study led by Bodo Winter did exactly that, focusing on a subset of ~3000 English words. While many words were rated as relatively arbitrary (i.e., no relationship between how they sound and what they mean), there were some exceptions: interjections like “ouch” or “shh” were rated as very iconic. Words related to sensory experience were also rated as more iconic on average. A 2012 study on British Sign Language (BSL) used a similar strategy, asking participants to rate the degree to which a particular sign resembled the meaning it conveyed; the researchers found that more iconic signs tended to be learned earlier than more arbitrary ones, suggesting that iconicity may be related to the learnability of a word.2 And more recently, another study led by Bodo Winter collected and released ratings for about ~14K English words.3
Another strategy is to use computational methods for inferring properties like iconicity. Here, the basic logic is simple (even if the implementation isn’t always): you need some way to represent the meaning of a word, as well as some way to represent its form; then, you ask whether different formal properties (e.g., certain sounds, etc.) reliably correlate with different semantic properties (e.g., certain concepts, etc.). In practice, representing the form of a word tends to be much easier than representing its meaning: there are well-established conventions for describing the sounds of a word (e.g., the International Phonetic Alphabet), and also methods for describing how different two wordforms are (e.g., Levenshtein distance).
But “meaning” is notoriously hard to formalize. Researchers have to rely either on hand-coded databases of “concepts” or on computationally derived (and imperfect) proxies for meaning like word vectors.
An example of the former is a 2016 study led by Damián Blasi. The researchers used a Swadesh List—a list of concepts alleged to be universal or near-universal across languages—to narrow down a set of about 40 core concepts (e.g., bone, star, etc.) that appear across thousands of languages. The authors then asked whether words for those concepts tended to use similar sounds across over 6K languages. They found that indeed, certain sounds reliably co-varied with certain concepts more than you’d expect by chance or from etymological similarities alone: for instance, the sound /n/ was positively associated with words for “nose”. That doesn’t mean every language has a word for “nose” that contains the sound /n/, but it does suggest there’s something about that sound that may make it particularly “well-suited” (to borrow an argument from Plato’s Cratylus) to words of that nature.
The latter approach relies on the same technology underlying large language models: namely, words can be represented as vectors reflecting their distributional patterns in a large corpus of text. The relative position of these vectors often corresponds to human judgments about word similarity. As such, researchers can use the proximity of two vectors in vector-space as a proxy for how similar in meaning those words are. Combined some measure of the similarity in word forms—e.g., the number of edits required to transform one word (“cat”) into another (“mat”)—researchers can then ask whether words that are more similar in form are also more likely to be more similar in meaning. This was the approach taken in a 2017 study led by Isabella Dautriche, which found a reliable correlation between formal similarity (i.e., how many edits were required to transform one word into another) and semantic similarity (i.e., the proximity of two word vectors in vector-space) across 100 languages. The average correlation itself was quite low (r = ~0.04) but reliably non-zero. This is consistent with the view that the lexicon is mostly arbitrary, but not entirely so.4
Why iconicity?
All this raises the question of why these iconic words exist. Does iconicity serve any kind of communicative or cognitive function? If it does, then we should expect iconicity to be non-randomly distributed throughout the lexicon. In particular, we might expect certain words to be more iconic than others.
There’s a growing body of evidence that words that tend to be learned earlier also tend to be less arbitrary (i.e., more iconic or systematic). Researchers have identified consistent correlations between age of acquisition and various measures of non-arbitrariness across a number of languages, including British Sign Language and English.
There’s also some experimental evidence: for instance, one 2011 study found that children were better able to learn actual mappings between Japanese sound-symbolic verbs and their meanings than randomly swapped mappings. That is, preserving the real, sound symbolic relationship in the Japanese lexicon was useful for learning new verbs—whereas a random (i.e., arbitrary) relationship resulted in slower and more error-prone learning.
Neither source of evidence is perfect on its own, of course: correlational studies about age of acquisition and non-arbitrariness don’t establish causation; and lab experiments often lack ecological validity. But together, they paint a compelling picture that non-arbitrariness could indeed scaffold word learning. This is made even more compelling by the fact that the argument makes sense at face value: if you have to learn a new set of symbols for a new set of meanings, it stands to reason that you’ll have an easier time if there’s some sort of cue in those symbols that reliably predicts which meaning they’re paired with.
In fact, the argument makes so much sense that one might wonder why language isn’t even more iconic. After all, if arbitrary mappings are harder to learn, why bother with them at all? There are a few potential answers, which I’ll explore in a post that’s coming up soon.
For more background, I recommend checking out Umberto Eco’s The Search for the Perfect Language.
More on that below.
I used that dataset in a 2024 paper investigating whether large language models (LLMs) can reproduce those iconicity judgments.
Other work has used more sophisticated statistical techniques, such as kernel regression, to identify systematicity in specific “pockets” in the lexicon.