The numbers list: Notes and warnings
You can limit the list by using the other controls, allowing you to concentrate on just the data you want.
The page makes very full use of Unicode. It looks great on my Mac. If you get a lot of square boxes instead of characters, you probably have bad Unicode support. Try installing the Linux Libertine or Gentium fonts. If that's not possible, I've kept around the old non-Unicode pages; see below.
Languages with over a million speakers are named in boldface. The data (from David Crystal) is a couple of decades old but it still indicates the major languages.
If the name starts with +, the language is extinct. (That means no native speakers; of course people often learn ancient languages for scholarship or other reasons.) Languages are disappearing at a truly alarming rate, and my sources on this are getting old, so probably many languages are listed as alive when they're really not.
Names in italics are dialects or other variants. Don't take this as very important; see below.
As should be obvious, if you see a notation like  po , that means to substitute the names for 5 and 1 into the expression.
Less obviously: if you see … , that means that the number is formed just like the number on its left, only using  rather . E.g. if you see that 6 is  so ɣitne  and 7 is … , that's equivalent to  so ɣitne . It saves a lot of space.
A number preceded by * is a reconstructed form.
I dearly appreciate everyone who's sent me numbers; but I want to particularly salute those whose kindness and hard work have been extraordinary: Jarel Deaton of Ohio, who is single-handedly responsible for more than a quarter of the numbers seen here; Eugene S.L. Chan of Hong Kong, who sent me his entire Austronesian database; and Carl Masthay of St. Louis and Pavel Petrov of Kaliningrad, who sent me their enormous, worldwide collection of numbers.
Special thanks to Claudia Griffith and the staff of the SIL Library in Duncanville, Texas, whose wonderful hospitality made a week of research in the summer of 2004 both pleasant and productive.
You may also enjoy:
There are often complications (e.g. declension of numbers, or different series of numbers for different purposes), and I haven't had room for them here.
If you want to trace relationships, numbers may be misleading, as they are easily borrowed. Conversely, related languages may have numbers that aren't cognate; they may have innovated the names in different ways.
The standard orthography or standard dialect may have changed since my source on a language was published.
Hundreds of millions of English speakers agree that the numbers are one, two, three, etc. But not all languages are standardized in this way. For unwritten languages, different linguists' word lists may be strikingly different. Their ears may not be attuned to the language; or there may be dialectal variation, or even sound change. Here's a couple examples, one from Asia, one from Africa:
I use standard orthographies, where there is one, rather than phonetic transcriptions. This makes comparison a bit more difficult; but I prefer it, for two reasons. First, it reduces errors; even if I can correctly interpret a source's phonetic description, there may be orthographic irregularities that make a straight transcription ludicrous. Secondly, an orthography is generally closer to a phonemic representation, which is arguably what people have in their heads.
Bru muəj ba:r paj po:n sə:ng təpat təpu:l təkual tikeas məncit Bru muoi bar pái poun sau'ng tapoât tapul takual takêh muoi chít Gurma yèn.dó lyé tà nâ mù lwọ̈bà lèle: nî pà:nì pyêgà Gurma n lè nlé nta nna nmu nluoba n lele nni n-ya ka piga
People can get very excited about what's a language vs. what's a dialect. There is nothing inherent in the language variety to tell us what it is. Linguists in general use "language" to refer to a mutually intelligible group of dialects (but note that intelligibility can be partial).
Ordinary people generally call something a "language" if it has a prestigious standard form; but that's a fact about people's attitudes, not about language. (Nonethelesss, if there is a standard form, it will be on the list!)
I generally rely on Voegelin & Voegelin, or on the original source for the numbers, in deciding whether to list something as a dialect (italicized). Some of my sources list multiple dialects; I usually try to pick the most widely spoken ones, and list others only if they're interestingly divergent.
Corollary: please don't complain to me about what's a dialect or a language-- you're arguing about nothing. (But feel free to send me additional dialects, or point out where I've messed up the names.)
Especially in the Amerind sections, I sometimes list older sources which may be of historical interest.
How many languages aren't here? Well, there's almost 5000 living languages listed in Ruhlen's volume; I have numbers for about 83% of them, so there's at least a thousand more. (If the math doesn't seem to work out, note that I have plenty of dialects and conlangs not included in Ruhlen's list.) There are about 200 languages with more than a million speakers, all of which are in the list.
Am I going to do higher numbers? Or zero? Probably not, unless I do it for a subset of languages only. Many of the sources don't even have numbers above ten.
The answer is simple: libraries. I have access to a few good university libraries, and when I can I visit others. You look in grammars, dictionaries, and books or journal articles surveying entire families.
And, if possible, find others who've been bitten by the same bug!
The following conventions apply only to the old files.
The picture shows the representations used for a number of IPA characters. I haven't been able to retain all phonetic distinctions, and some have been lost-- for instance, the distinction between a circumflex (â) and a hachek (ǎ).
For African tonal languages, a macron - indicates a high level tone, not length, and is represented as _. | is another tone, usually low level. For non-African languages, a macron indicates length and is indicated :.
? indicates the glottal stop (but if my sources spell it as an apostrophe or q, I follow them)
bold indicates a character which was dotted in the original source-- usually an emphatic or retroflex consonant
italic indicates open e and o and lax i and u, or a character that was italicized in the original source