Language Family Information for the Numbers List

This file supplements the [Numbers Index] with some interesting facts about the world's language families, individual languages, or their numeric systems.

The numbers following language names are the number of speakers, from Lyovin's book, and represent native speakers.


The best-studied and most widely spoken of the world's language families. Similarities between I-E languages were noted even in ancient times, but the key realization that they derived from an extinct protolanguage, and the important connection to the Indo-Iranian languages, were first clearly stated by William Jones in 1786. Within a century scholars had produced the first reconstruction of Proto-Indo-European.

One of the striking features about PIE is its reliance on vowel changes in conjugation; some of the rare survivals of this in English are verb paradigms such as sing/sang/sung. PIE had a rich system of inflections, including three numbers (singular/dual/plural) and three genders.

A readily available reference on Proto-Indo-European is the back of the American Heritage Dictionary, a readily available source on PIE and quite interesting to anyone interested in etymology. Why be satisfied with a derivation from Latin or Germanic when you can trace a word back to PIE?

Germanic. The earliest Germanic texts we have are a 4C Gothic translation of the Bible. The earliest English texts are from the 7C. English does not derive from German; rather, both derive from proto-Germanic.

Italic. In ancient times Latin was only one of several Italic languages spoken in Italy; others included Oscan, Umbrian, and Faliscan. Some of these survived into the 1C, but all the modern Romance languages are derived from Latin. The earliest texts in Romance languages are French, from the 9C.

We have an enormous corpus of ancient Latin; the earliest inscriptions date back to about 500 BC. For an introduction to Latin you couldn't do better than Humez & Humez's Latin for People, which contains such delightful sample sentences as Venimus ad Galliam sed non currimus, "We're coming to Gaul but we're not running", or Dulce et decorum est pro patria mori. Amarum et indecorum est a Vesuvio interfici, "It is a sweet and seemly thing to die for one's country. It is a bitter and unseemly thing to be buried by Vesuvius." Celtic. Irish is an official language of Ireland, and public institutions are named in Irish.
The earliest records of any Celtic languages are 1C inscriptions in Gaulish.

Celtic numbers are preserved in counting sets called scores, used in counting sheep, counting stitches, and in children's games. Here's a set from the North Country: yan, tan, tethera, pethera, pimp, sethera, lethera, hovera, covera, dik.

Hellenic. Mycenaean Greek is the language of Linear B, dating to the 14C BCE, and proven to be Greek by Michael Ventris in 1952. Linear B has nothing to do with the Greek alphabet, which was invented centuries later; it was written using a syllabary.

Tocharian A and B are a pair of extinct languages once spoken in Xinjiang, whose existence came to light only in the 1890s.

Albanian was one of the later languages to be assigned to Indo-European; it has replaced a substantial portion of the IE vocabulary.



Slavic. The earliest Slavic inscriptions date back to the 9C.

Anatolian. The texts in Hittite, dating to the 17C BCE, are the oldest Indo-European texts we have, but were discovered only about a century ago. They provided the most spectacular confirmation of a historical-linguistic prediction-- namely Saussure's postulation of coefficients sonantiques, the so-called laryngeals, in Proto-Indo-European, not directly attested in any then known IE language, but some of which actually turned up in Hittite. On the other hand, Hittite turned out to more different from the other IE languages than was expected, which has led to some re-evaluation of the protolanguage. Some people consider Hittite and Indo-European to be branches off an earlier "Indo-Hittite"; but my Indo-Europeanist consultant considers this a ploy to avoid having to integrate information from Hittite into IE.

Indo-Iranian We have Old Persian inscriptions dating to the 6C BCE, and Sanskrit texts dating back to about 1000 BCE. In the 18C, European scholars newly familiar with Sanskrit recognized that it was related to Greek and Latin, and began a philological joyride that ended in the reconstruction of Proto-Indo-European (chauvinistically called Indogermanisch by the mostly German scholars involved). Early on Sanskrit was assumed to be particularly close to the protolanguage, but it has since been realized that this is not the case. Linguists retain a reverence for the accuracy of the ancient Sanskrit grammars, such as those of Panini (-4C).

Ardhamagadhi, one of the post-Sanskrit dialects or Prakrits, is the language of the Jain scriptures.


Spoken in southwestern Persia in ancient times; the earliest inscriptions date to the 25C BCE. No generally accepted affiliation, though Ruhlen, following McAlpin, links it to Dravidian.


Dravidian languages are found largely in the southern third of India, but there are pockets further north, notably Brahui, in Pakistan. It's likely that Dravidian once extended over all of India, and was displaced by the Aryan (Indo-European) invaders three milennia ago. Dravidian features such as retroflex consonants have spread to the Indic languages, while Sanskrit has had an enormous influence on Dravidian.


The genetic affiliation of Nahali is controversial. About 40% of the lexicon is cognate to Munda languages, and some linguists therefore put it in that group. Among the numbers, 2-4 are borrowed from Dravidian, and 5-10 from Indic.


A language isolate, spoken in the Pakistani part of Kashmir in a very remote area. It's been linked to Caucasian languages because of its four-gender system (masculine, feminine, animate, other), and to Basque because it's ergative and SOV; but merely typological similarities are linguistically pretty lame.



The Semitic languages are notable for inflections that consist of vowel changes applied to a triconsonantal root. For instance, the Arabic root KTB produces verbal forms like kataba 'he wrote', katabat 'she wrote', taktubu 'you write', taka:taba 'correspond with each other', yukattibu 'cause to write', and nominal forms such as kita:b 'book', kutubi: 'bookseller', ka:tib 'writer', maktaba 'library', and so on.

Semitic languages also have a long written history, starting with Akkadian around 3000 BCE. We have Canaanite inscriptions going back to the 20C BCE. The Tanakh, the Hebrew Bible, was written over a period of a milennium (1200-200 BCE).

The earliest Arabic inscriptions date to the 4C CE, but of course its classic text is the 7C Qur'a:n. Arab regions are noted for diglossia, in which the spoken and written languages are highly divergent. Throughout the Arab world the standard written language (also used for formal speech) is Classical Arabic, which no one speaks as a native language-- it must be learned in school. The spoken language has diverged greatly from this standard, and varies widely between countries as well; uneducated Arabs from different ends of the Arab world cannot communicate with each other. The Egyptian family boasts some of the oldest written records (from 3000 BCE), as well as spanning the longest time, 4500 years-- Chinese won't equal the record of Ancient Egyptian until about 2700 CE. Modern Egyptian does not descend from Ancient Egyptian but from Arabic. The modern descendent of the pharaohs' language is Coptic, still used as a liturgical language by Egyptian Christians. Nimbia, a dialect of Gwandara in the Chadic family, is notable for having a duodecimal number system. 12, not shown on the Numbers page, is tùni; 13 is tùni m`bé da '12 + 1', 30 is gùme bi nì shídé '24 + 6', etc.





#2 is all that's attested. Meroitic was the language of Meroe, an ancient kingdom south of Egypt.



Caucasian languages (which many scholars divide into two to four unrelated families) tend to have SOV word order and ergative case systems-- the same can be said of Basque, which has led to plenty of speculation but no solid proof of relationship. They also tend to have rather baroque consonantal systems-- Ubykh, for instance, has 82 consonant phonemes.



The odd characters in the Khoisan languages (spoken in southwestern Africa) represent clicks, which are used as phonemes only in this group and some neighboring Bantu languages. !Xu~, from this family, has the distinction of having the largest known inventory of phonemes: 141. Most languages have between 20 and 40.


Generally grouped with Niger-Congo as Niger-Kordofanian. I've kept them separate mainly so the classifications in Niger-Congo don't all have to move down a level.

Niger-Congo cannot be considered a well-established family (though some of its subfamilies, such as Bantu, are). There is no reconstruction of Proto-Niger-Congo on a par with IE, Semitic, Austronesian, Algonquian, etc.

An interesting tidbit about Krongo: the numerals are verbs. (This is true of a few Amerind languages as well.)


Most of the languages of Africa (from roughly the southern edge of the Sahara on south) belong to this huge family. The Roman alphabet really breaks down here: not only do most of these languages distinguish open and closed e and o (represented in the list as e and e, o and o), but they're tonal as well. In some languages there are words with a "floating tone", which is not associated with any syllable in the word, but is realized in a following word!

Niger-Congo numeric systems are generally based primarily on fives. The numbers 6-9, for example, are often 5 + 1-4. Sometimes the derivations have become obscured through sound change (compare Spanish once = 10 + 1) or through borrowing (e.g. Swahili has borrowed 6-9 from Arabic). Other derivations are possible as well. Sometimes there's a special word for 8 (itself perhaps derived from 'two fours'), and 9 = 8 + 1; there may likewise be a word for 6 used to derive 7. 9 and sometimes 8 may be expressed as '10 minus 1 (or 2)'.

For higher numbers, the Bantu languages tend to be organized by tens, the western languages by twenties.

The Yoruba number system is notable for its reliance on subtraction: e.g. 19 ookan din logun = 20 - 1, 46 = 60 - 10 - 4, 315 orin din nirinwo odin marun = 400 - (20*4) - 5.

The word for 7 in Kimbundu (a Bantu language), sambuari, derives from 6 + 2-- this is a euphemism, replacing the original word for 7, which is taboo. If that seems strange, there are rumors of a major North American civilization in which buildings are built without a 13th floor.

As can be seen by comparing Johnston 1919 with the 1970s Tanzanian Language Survey, compound numbers for 6-9 are being replaced in many languages with the Swahili numbers (themselves from Arabic).


The existence of the Uralic family was recognized in the 18C, before William Jones' famous speech. The earliest attestation of Finnic is a Karelian inscription from the 13C; in Ugric, Hungarian inscriptions from about 1200. A connection with the Altaic languages has been posited, but is due mostly to typological similarities, which are not very convincing.


It isn't at all certain that Altaic is a valid genetic grouping; complicating the question is the fact that these languages have existed in mutual contact for milennia, so that it is not easy to separate borrowing from genetic relationship.


Korean is not closely related to any language. It may be distantly related to Japanese and to Altaic.



The Sinitic languages are (as is well known) tonal languages; and so are the Tai languages and Hmong-- but these are not closely related to Chinese, while the related Tibeto-Burman languages generally aren't tonal. Chinese texts go back to the 17C BCE; Tibetan, to the 7C CE; Burmese, to the 12C CE.

Qiangic. Information on this branch of Tibeto-Burman has only very recently come to the attention of Western scholars, thanks to Chinese research of the '80s and '90s. The extinct Tangut or Xixia language, which is amply attested in a logographic script form the 11C, is now thought to belong to this family.



Tai languages were once spoken much more extensively in southern China, up to the Yangtze River. Tai-Kadai and Chinese have influenced each other, such that it isn't easy to piece together who borrowed what from where. Earlier classifiers thought Tai and Chinese were related, but this is no longer thought to be the case; the resemblances are due to borrowing.


Yumbri is the first language I've run into that is said to have no numbers at all. The words given are glossed 'little' and 'much'. But note that neremoy, at least, looks a lot like 'one' in other Austro-Asiatic languages, e.g. Rengao mói'?



Austronesian is the largest language family in the world, with about 1000 separate languages. It's also well established, with proto-Austronesian partly reconstructed.

People often think that linguists classify languages into families based on similar-sounding words. In fact the basis is regular sound correspondences between languages, whether the words sound the same or not. A neat example comes from the East Santo group: Sakao iedh and Shark Bay tharr don't sound at all alike, nor anything like proto-Vanuatu *vati. But they are in fact all cognates, and help demonstrate that these languages are related.

Linguist Jacques Guy has reconstructed the course of events in this way. Both languages changed bilabials to dentals before front vowels, and lost final vowels; thus *vati --> *thati --> *that.

In Sakao, there was furthermore a complex vowel shift; and then almost all consonants were lenited (weakened), voiceless stops to voiced fricatives, fricatives to approximants: *that --> *thet --> *yedh.

Finally, in Shark Bay, final -t changes to a trill: *that --> *tharr. QED.

...Bam, a Sepik-Madang language, is curious for being a 4-based system. 10 is 'four-two and two', 12 is kiki tuol 'four-three', and so on. Curiously 20 kiki lim uses the usual Austronesian morpheme for 5, but 5 itself doesn't: 5 is kiki be kubua 'four and one'.






Indo-Pacific is not a well-established language family, but a geographical collection of the 60 or so small language families of New Guinea. Their genetic connections, if any, cannot be securely determined until we have a better grasp on the wide-scale grammatical and lexical diffusion that has occurred.

The Kewa numbers represent just the beginning of a 24-member counting sequence. The first five numbers name the little finger through the thumb; but instead of continuing with the other hand the Kewa keeps indicating points a few inches along on the body: 9 = 'forearm', 15 = 'shoulder', 20 = 'ear', 24 rikaa = 'between eyes'.

The Bugilai numbers are etymologically body parts; e.g. 1 tarangesa 'left hand little finger', 5 manda 'thumb', 10 dala 'right breast'.

Kanum and Kimaghana seem to be base 6 systems.

Andaman. 3/4/5 in Aka-Bea-da etc. actually mean 'one more', 'some more', 'all'.


Small families of Australian languages have been identified, but assembling them into larger families has proved frustratingly difficult. R.M.W. Dixon believes that the family tree model is not very useful in Australia; rather, hundreds of languages existed in a dynamic equilibrium, grammatical features and lexemes diffusing across different regions or the whole continent.

Many of the Australian languages have a limited set of numbers. (That doesn't mean they're simple languages-- they tend to be quite complex.) Some number words, as shown, represent not a single number but a range.

I have to wonder when some languages, like Yir Yoront, have a full set of numbers, but we're told that most Australian languages stop at 2, 3, or 4. As in many languages, the number words in Yir Yoront refer directly to the process of counting on the hands: 5 = "whole hand", 7 = "hand entire, fingers two", 10 = "hand-two". It makes me wonder if most fieldworkers are asking the wrong questions.

Amerindian langauges

In Indo-European languages we are used to unanalyzable roots for the numbers; but in other families number names can be derivations, often related to the process of counting on fingers and toes-- e.g. Choctaw 5 = talhlhaapih 'the first (hand) finished'; Bororo 7 ikéra metúya pogédu 'my hand and another with a partner'; Klamath 8 ndan-ksahpta 'three I have bent over'; Unalit 11 atkahakhtok 'it goes down (to the feet)'; Zuñi 10 astemthla 'all the fingers'; Shasta 20 tsec 'man' (considered as having 20 countable appendages).


Navajo is the Amerindian language with the greatest number of speakers in the United States-- about 100,000.

Greenberg groups all the Amerindian languages below (that is, excluding Eskimo-Aleut and Na-Dené) into a single family, Amerind. His conclusions are based only on "mass comparison", not the comparative method, and are not accepted by Amerindianists.

The North American languages are well studied, and many families here are well established, often with reconstructed proto-languages. The same cannot be said for South America. Check back in fifty years.



Cree is the Amerindian language with the greatest number of speakers in Canada, eh, with about 80,000.




Nahuatl (Aztec) is famously a vigesimal system-- e.g. 37 is cempoalli oncaxtolli omome '20 + 17', and there is a special word for 400, tzontli (literally 'hair', figuratively 'an abundance'). The numbers from 1 to 19 group into fives (e.g. 17 caxtolli omome '15 and 2'), so the system may be more precisely called a "5-20" system.


Northern Pame is interesting for being a consistent base-8 system.


Many Mexican, Central American, and Californian languages have number systems based not on 10's but on 20's. This is not always evident from the numbers from 11 to 19, some of which may be compounds as in a decimal system; but it becomes clear from higher numbers-- e.g. 100 is expressed as 'five twenties', and there are special words for powers of 20-- e.g. in Yucatec 201 through 206 are kal, bak, pic, calab, kinchil, alau.

The Mayan languages are notable for having a fully developed writing system, deciphered only in this century, and for having a symbol for zero. For the story of the decipherment, see Michael Coe's Breaking the Maya Code.


Some Amazonian languages, like Yanomami, have number roots only for 1 to 3. This doesn't at all mean (as hasty observers conclude) that the people can't count past 3. They have fingers and toes and know how to use them; and if a Yanomami leaves 20 arrows by you, and there aren't 20 when he returns, woe to you. A lack of roots just limits the numbers that can be named out loud-- or at least named out loud the same way every time, since speakers may be able to come up with ad hoc names.


Quechua is the Amerindian language with the greatest number of speakers-- over 7 million. It was of course the language of the Inca Empire; but was also spread by Spanish missionary work.

The Incas exchanged accounting information by means of kipus (literally 'knots'), bundles of knotted strings. Each string recorded one or more numbers, and strings were grouped into color-coded bunches, sometimes with totals attached, as in a spreadsheet. The numerical code was decimal; each digit was represented by 0 to 9 knots; the units were made with a different sort of knot so that more than one number could be coded on one string.

Urarina (which Ruhlen puts in this group, but others consider an isolate) boasts two very unusual features among the world's languages: it has no /p/ sound (note that Quechua pusaq '8' was borrowed as fusa-); and it is consistently OVS.


Guaraní may be considered the most successful modern Amerindian language. It's spoken by the majority (88%) of the population of Paraguay-- most of which is mestizo, not pure Amerind-- and has a secure place in Paraguayan society. Where in many places you might switch from a formal to an informal pronoun when you get to know someone well, in Paraguay you may switch from Spanish to Guarani.


The Bakairi have a binary system; numbers above 2 (ahage) are formed by combinations of the words for 1 and 2 (though they stop at 6 and after that count by repeating mera 'this one'). The computer-savvy may object that a binary system should just have words for 0 and 1; but note that that's not how our own decimal numbers work: we have a word for ten.

The Cherente word for 2 (ponhuane) analyzes as 'deer track' (since a deer hoofprint has two separate parts).

Pidgins and creoles

Although the languages in this section are almost all based on West European languages, there are pidgins and creoles based on non-IE languages. Two are listed with the Amerindian languages: Chinook Jargon and the Mobile Trade Language. Other examples are Pidgin Hamer (based on the Omotic language Hamer), Hiri Motu (based on the Austronesian Motu), Kituba (based on the Kongo languages), and Fanagalo (another Bantu pidgin). For more on pidgins and creoles, see Sarah Grey Thomason & Terrence Kaufman, Language Contact, Creolization, and Genetic Linguistics, 1988.

Michif is hard to figure: (oversimplifying), the nouns, pronouns, and numerals (except #1; cf. Cree peyak) are French, the verbs are Cree-- fairly complex verbs, too. It can't really be considered a pidgin; most likely it developed among bilinguals.

Constructed languages

A priori languages are not based on existing languages; they're often an attempt to create a more logical or more organized way of looking at the world. (The lexicon of Loglan and Lojban is not technically a priori; but as 'logical languages' they certainly fit into this category.)

Many projects have 1 = ba or something like it-- almost inevitable if you work out the numbers in alphabetical order. E.g. Leibniz uses consonants for the digits, in alphabetical order; vowels for the powers of ten, also alphabetical, so 1679 = bohilena. But it's Letellier who wins the prize for conciseness, with one letter per digit: e.g. 1679 = ba:co: (the colons represent macrons).

Hilbe has a showoffy trick for representing higher numbers: rXr is one million to the X power-- e.g. rar = 106, rer is 1012-- up to a million to the millionth power, which has its own name, qar = 106000000. And qar to the qarth power is xar. Beats a googolplex to hell any day.

Some of the language names are unwieldly or repetitive, and they're represented on the numbers page by the creator's name.

A posteriori languages are based on existing natural languages (or are developments of previous a posteriori languages (e.g. Ido)).

Artlangs are languages developed for personal or artistic reasons alone.

Tepa is worth checking out; it's based on Amerindian models, and designed by an expert in Numic.

Marnen calls DiLingo "possibly the funniest conlang"; I have to agree. It's hard to read some sentences in it without laughing.

Maktalu is a duodecimal language; 11 and 12 are ushi and fani; the others are siblings or ancestors.

Cispa is intended for eight-fingered aliens, it has both octal and decimal variants. Jarrda is spoken by raccoons.

The words for 11-18 in Draseléq include quirky etymologies like 12 = "the divisible one" and 17 = "the imperfect".

My Wedei is a base-6 system; Methaiun is a 5-10-18 system (18 is oranda), since Almeans have five fingers but four toes.

[Numbers Index]