The numbers list: Notes and warnings

Numbers 2.0!

As of September 2016, the numbers page has been entirely redone. The major changes: And on my end, the source file has also been upgraded, which allows far easier updating. (The input is actually a raw text file; the display html is built on the fly in Javascript.)

How do I use it?

If it's not clear: push the List button to get a list of numbers! You must have Javascript enabled.

You can limit the list by using the other controls, allowing you to concentrate on just the data you want.

The page makes very full use of Unicode. It looks great on my Mac. If you get a lot of square boxes instead of characters, you probably have bad Unicode support. Try installing the Linux Libertine or Gentium fonts. If that's not possible, I've kept around the old non-Unicode pages; see below.

Symbols and conventions

The colored headings indicate language families. A name in brackets (e.g. [Andean]) is speculative.

Languages with over a million speakers are named in boldface. The data (from David Crystal) is a couple of decades old but it still indicates the major languages.

If the name starts with +, the language is extinct. (That means no native speakers; of course people often learn ancient languages for scholarship or other reasons.) Languages are disappearing at a truly alarming rate, and my sources on this are getting old, so probably many languages are listed as alive when they're really not.

Names in italics are dialects or other variants. Don't take this as very important; see below.

As should be obvious, if you see a notation like [5] po [1], that means to substitute the names for 5 and 1 into the expression.

Less obviously: if you see … [2], that means that the number is formed just like the number on its left, only using [2] rather [1]. E.g. if you see that 6 is [5] so ɣitne [1] and 7 is … [2], that's equivalent to [5] so ɣitne [2]. It saves a lot of space.

A number preceded by * is a reconstructed form.


The Sources Page gives the sources for each language (and also lists languages I don't have, and connects the languages to other wide-scale classifications: Ruhlen, Voegelin & Voegelin, Campbell, and the Ethnologue).

I dearly appreciate everyone who's sent me numbers; but I want to particularly salute those whose kindness and hard work have been extraordinary: Jarel Deaton of Ohio, who is single-handedly responsible for more than a quarter of the numbers seen here; Eugene S.L. Chan of Hong Kong, who sent me his entire Austronesian database; and Carl Masthay of St. Louis and Pavel Petrov of Kaliningrad, who sent me their enormous, worldwide collection of numbers.

Special thanks to Claudia Griffith and the staff of the SIL Library in Duncanville, Texas, whose wonderful hospitality made a week of research in the summer of 2004 both pleasant and productive.

You may also enjoy:

Some caveats

There are often complications (e.g. declension of numbers, or different series of numbers for different purposes), and I haven't had room for them here.

If you want to trace relationships, numbers may be misleading, as they are easily borrowed. Conversely, related languages may have numbers that aren't cognate; they may have innovated the names in different ways.

The standard orthography or standard dialect may have changed since my source on a language was published.

Hundreds of millions of English speakers agree that the numbers are one, two, three, etc. But not all languages are standardized in this way. For unwritten languages, different linguists' word lists may be strikingly different. Their ears may not be attuned to the language; or there may be dialectal variation, or even sound change. Here's a couple examples, one from Asia, one from Africa:

Bru muəj ba:r paj po:n sə:ng təpat təpu:l təkual tikeas məncit
Bru muoi bar pái poun sau'ng tapoât tapul takual takêh muoi chít
Gurma yèn.dó lyé lwọ̈bà lèle: pà:nì pyêgà
Gurma n lè nlé nta nna nmu nluoba n lele nni n-ya ka piga
I use standard orthographies, where there is one, rather than phonetic transcriptions. This makes comparison a bit more difficult; but I prefer it, for two reasons. First, it reduces errors; even if I can correctly interpret a source's phonetic description, there may be orthographic irregularities that make a straight transcription ludicrous. Secondly, an orthography is generally closer to a phonemic representation, which is arguably what people have in their heads. 

Languages and dialects

People can get very excited about what's a language vs. what's a dialect. There is nothing inherent in the language variety to tell us what it is. Linguists in general use "language" to refer to a mutually intelligible group of dialects (but note that intelligibility can be partial).

Ordinary people generally call something a "language" if it has a prestigious standard form; but that's a fact about people's attitudes, not about language. (Nonethelesss, if there is a standard form, it will be on the list!)

I generally rely on Voegelin & Voegelin, or on the original source for the numbers, in deciding whether to list something as a dialect (italicized). Some of my sources list multiple dialects; I usually try to pick the most widely spoken ones, and list others only if they're interestingly divergent.

Corollary: please don't complain to me about what's a dialect or a language-- you're arguing about nothing. (But feel free to send me additional dialects, or point out where I've messed up the names.)

Especially in the Amerind sections, I sometimes list older sources which may be of historical interest.

What's not here?

How many languages aren't here? Well, there's almost 5000 living languages listed in Ruhlen's volume; I have numbers for about 83% of them, so there's at least a thousand more. (If the math doesn't seem to work out, note that I have plenty of dialects and conlangs not included in Ruhlen's list.) There are about 200 languages with more than a million speakers, all of which are in the list.

Am I going to do higher numbers? Or zero? Probably not, unless I do it for a subset of languages only. Many of the sources don't even have numbers above ten.

How was this done?

People sometimes ask me how I accumulated all these numbers, or how to do this sort of research.

The answer is simple: libraries. I have access to a few good university libraries, and when I can I visit others. You look in grammars, dictionaries, and books or journal articles surveying entire families.

And, if possible, find others who've been bitten by the same bug!

The old files

If you can't read the Unicode files, I've kept the oldest versions of the numbers pages, which use no Unicode at all. (They do use the Latin-1 characters the web has always supported.)

The following conventions apply only to the old files.

The picture shows the representations used for a number of IPA characters. I haven't been able to retain all phonetic distinctions, and some have been lost-- for instance, the distinction between a circumflex (â) and a hachek (ǎ).

For African tonal languages, a macron - indicates a high level tone, not length, and is represented as _. | is another tone, usually low level. For non-African languages, a macron indicates length and is indicated :.

? indicates the glottal stop (but if my sources spell it as an apostrophe or q, I follow them)

bold indicates a character which was dotted in the original source-- usually an emphatic or retroflex consonant

italic indicates open e and o and lax i and u, or a character that was italicized in the original source