Deriving Proto-World with tools you probably have at home

Discussions of 'Proto-World' have gotten quite a bit of press lately-- not as much as Di's divorce, but about as much as any topic in historical linguistics ever gets.

Is there anything to it? Very probably not-- which is a pity, because getting back to Proto-World sounds like a lot of fun, and now it seems like the only alternative is to wait for aliens to come by who had a tape recorder running one or two hundred thousand years ago.

Hans Henrich Hock gave a talk at CLS 29 on Ruhlen and Greenberg's "world etymology" maliq'a 'swallow, throat', pointing out quite a few serious methodological problems. It may be worth repeating some of his points. To start with, here's R&G's supporting citations:
Proto-Afro-Asiatic Afro-Asiatic *mlg 'suck, breast, udder'
Arabic Afro-Asiatic m-l-j 'suck the breast'
Old Egyptian Afro-Asiatic mndy 'woman's breast, udder'
Proto-Indo-European Indo-European *melg- 'to milk'
English Indo-European milk 'to milk, milk'
Latin Indo-European mulg-e:re 'to milk'
Proto-Finno-Ugric Finno-Ugric *mälke 'breast'
Saami Finno-Ugric mielga 'breast'
Hungarian Finno-Ugric mell 'breast'
Tamil Dravidian melku 'to chew'
Malayalam Dravidian melluka 'to chew'
Kurux Dravidian melkha: 'throat'
Central Yupik Eskimo-Aleut melug- 'to suck'
Proto-Amerind *maliq'a 'to swallow, throat'
Halkomelem Almosan m@lqw 'throat'
Kwakwala Almosan m'lXw-'id 'chew food for the baby'
Kutenai Almosan u'mqolh 'to swallow'
Chinook Penutian mlqw-tan 'cheek'
Takelma Penutian mülk' 'to swallow'
Tfaltik Penutian milq 'to swallow'
Mixe Penutian amu'ul 'to suck'
Mohave Hokan malyaqe' 'throat'
Walapei Hokan malqi' 'throat, neck'
Akwa'ala Hokan milqi 'neck'
Cuna Chibchan murki- 'to swallow'
Quechua Andean malq'a 'throat'
Aymara Andean malyq'a 'throat'
Iranshe Macro-Tucanoan moke'i 'neck'
Guamo Equatorial mirko 'to drink'
Surinam Macro-Carib e'mo:kï 'to swallow'
Faai Macro-Carib mekeli 'nape of the neck'
Kaliana Macro-Carib imukulali 'throat'

Now, there's no denying that seeing such a list is suggestive, and that it seems like there must be something in it. I'll maintain, however, that this is simply self-delusion-- a consequence of the human ability to make connections even in the face of near-random data.

Take a closer look at the list; the rules for this game are evidently quite lax. The vowels are completely ignored. The middle consonant varies from l to ly to lh to n to r to zero. The end consonant ranges from g to j to d to k to q to q' to kh to k' to X to zero. Switching around medial consonants seems to be allowed; extra consonants and syllables can appear where needed.

Observe the semantic variation as well: body parts ranging from neck to nape to throat to breast to cheek; actions including swallowing, milking, drinking, chewing, and sucking. Some defenders of Ruhlen & Greenberg make much of the probability of finding such lists among given numbers of families; but notice that one can pretty much pick and choose what languages from a family to include. If Greek doesn't do it for you, try Latin; if Hebrew doesn't work, use Arabic.

The truth is that lists like this are not hard to produce-- au contraire. Just to demonstrate this I've taken a number of words at random in Chinese and looked for 'cognates' in Quechua (and wherever else I could think of one), using as best as I could the level of phonetic and semantic variation evidenced in R&G's list above. If I had more dictionaries at hand I'd find you more.

Chinese  ren     'person'

Quechua runa 'person'

Chinese ch'ung 'insect'
Quechua chinchi 'type of insect'
English chigger

Chinese shui 'water'
Quechua sut'u 'wet'
French suée 'sweat'
Greek hudor 'water'
Dutch schuit 'boat'
Turkish su 'water'

Chinese shuohua 'talk'
Quechua suka 'whistle'
French charler 'chat'

Chinese lao 'old'
Quechua laqla 'old'
Tok Psn. lapun 'old'

Chinese nai 'breast'
Quechua ñuñu 'breast'
French néné 'breast'
Bulgar. nenka 'breast'

Chinese sheng 'rise'
Quechua seqay 'rise'

Chinese cheh 'this'
Quechua chay 'that'
French ce 'this/that'

Chinese chihfan 'eat'
Quechua chipay 'close mouth'
French chef 'cook'

Chinese chung 'middle'
Quechua chawpi 'center'
Italian centro 'center' (c = ch)

Chinese ti 'earth'
Quechua tiksimuyu 'earth'
Spanish tierra 'earth'

Chinese ch'ing 'please'
Quechua hinay 'do thus'

Chinese wang 'king'
Quechua waminqa 'chief'

Chinese you 'again'
Quechua yapa 'addition'
Spanish ya 'already'

Chinese kung 'work'
Quechua kunay 'carry'
English gung-ho 'eager to work'

Chinese ch'uan 'river'
Quechua chumay 'dip in water'
Spanish chupar 'drink, suck'
Dutch schoon 'clean'

Chinese lai 'come'
Quechua riy 'go'
French aller 'go'

Chinese ai 'love'
Quechua ayni 'mutual help'
French aimer 'love'

Chinese san 'mountain'
Quechua senqa 'mountain peak'
French chaîne 'mountain range'

Chinese nü 'woman'
Quechua ñusta 'princess'
Dutch nuf 'aloof girl'
Greek (gy)ne 'woman'
Latin (femi)na'woman'
French nana 'woman'
German -in fem. suffix

Chinese ma 'mother'
Quechua mama 'mother'
French maman 'mother'

Chinese nan 'difficult'
Quechua nanaq 'painful'

Chinese kei 'give'
Quechua qoy 'give'
Scots gie 'give'

Now, anyone can see that almost all of these correspondences are completely bogus. We know where French suée comes from, and it's not from Chinese. R&G really gain the benefit of obscurity here: how many of us can determine whether they are (unconsciously) playing the same kind of tricks with Tfaltik and Guamo as I am playing with Chinese and Quechua here? (Amerindian specialists, in fact, are quite skeptical about R&G's claims.)

(By the way, if anyone thinks I'm using odd words or glosses, so are R&G. I can't even find their Quechua word malq'a in eight Quechua dictionaries; the usual Quechua word for 'throat' is q'oto. Mallq'a is the Aymara word for 'throat', but I don't know where the 'to swallow' gloss comes from; Aymara for 'to swallow' is thataña.)

All this is intended to show how easy it is to find such spurious correspondences. But that's not the end of it; Ruhlen & Greenberg have the opposite problem as well: there's not only too much variation in their list, there's too little. Languages that really are related have diverged much more in 6000 years than some of R&G's words seem to have diverged in at least 10,000.

Hock uses Hindi and English as an example. The following words, for instance, are real cognates:

cakka:       wheel

pa:nch five
si:~g horn
chah six
pissu: flea

Surely R&G would be embarrassed to pick words that far apart as cognates for Proto-World; with that level of phonetic resemblance, everything is related to everything. On the other hand, they might seize upon such a pair as Hindi lu:t. 'rob' and English 'loot'... but these are not cognates; English borrowed the word from Hindi. The actual English cognate of Hindi lu:t. is 'leaf'... which illustrates as well some of the semantic divergence that can occur in 6000 years.

Applied to the Indo-European family (which we know from careful comparative work to be related), R&G's mass comparison would yield large numbers of both false positives (lu:t. and loot, day and dies, have and habere) and large numbers of false negatives (cakka: and wheel, lu:t. and leaf, date and dacha, milk and lettuce). Applied to unrelated languages, the method will generate long lists of bogus resemblances due to chance (as in my Quechua/Chinese comparison above).

I'm tempted to say that the true cognate of maliq'a in English is 'malarkey'. Only the comparative method can reveal whether any of the relationships postulated by R&G are real. But the comparative method takes time and patience, and so it's probably long going to be at a disadvantage in the marketplace of ideas, compared to a method which offers quick answers of the type we want to hear.

Or maybe Chinese does derive from Quechua?

When I first posted this stuff to the Net, one gentleman wondered aloud (wondered anet?) if I might have proved that Chinese and Quechua are related. Some days it's not worth getting out of bed.

Similar words with similar meanings do not prove that languages are related. They might point to a relationship-- but they might also be due to borrowing ('gung ho' really is from Chinese); they might be due to universal processes like babytalk or onomatopoeia; and above all they may just be chance.

This seems to be hard for some people to accept. Just look at ren and runa, or gaijin and goyim, they seem to think-- how could that possibly be due to chance?

These people should be treated with respect. They are the people who made Las Vegas what it is today.

What are the chances of finding maliq'a-style pseudo-cognates? Well, empirically, based on my experiences finding the above Quechua/Chinese list, the answer is "One half." That is, with a little ingenuity, and given languages with reasonably compatible phonologies, you can find a 'cognate' between two unrelated languages about once out of every two words you try.

People sometimes offer statistical algorithms showing that this cannot possibly be; but a good rule of thumb is that when reality doesn't match your algorithm, you throw out the algorithm, not reality. Finding meaningless resemblances between languages is easy. If your probability estimates say it's not, you're doing something wrong-- most likely describing absurdly close matches when calculating your probablities, but using absurdly loose ones when searching for cognates.

Don't the probabilities become meaningful once you look at hundreds of words, or at many language families? Well, no. A bad methodology doesn't become more respectable just by repeating it. My Quechua/Chinese bogus cognates do not merit additional respect when I add to them a few more bogus cognates from Greek, Spanish, or French.

Note that R&G's list does contain quite a few real cognates-- within families, which bulks up the list and adds to the impression of suggestive similarity without actually adding any more information. There's three Indo-European languages in the list, three Afro-Asiatic ones, three Finno-Ugric, two Dravidian, three Almosan, three Macro-Carib, four Penutian, three Hokan, and two Andean (well, Quechua and Aymara may not be related, but the two words cited certainly are). All in all there's 19 completely non-functional entries in the list, or more than half of the list.

(For that matter, the situation gets worse rather than better for R&G if recently proposed superfamilies are accepted. If Greenberg is right about Amerind, for instance, the maliq'a list is reduced to six cognates; if Nostratic is accepted, it's reduced to three.)

Just to ram the point into the ground, here's another list of pseudo-cognates, this time between English and Chinese (and this time using pinyin, in case anyone thought I was playing some kind of trick by using Wade-Giles above).

baba 'daddy'                 papa

bai 'white' fair (in color)
ban 'remove' ban
bao 'luxuriant foliage' bough
bei 'low, vulgar, mean' base
bei 'passive marker' by
beihou 'behind' behind
bengdai 'bandage' bandage
bi 'pen' bic, biro
bu 'book' book
chang 'sing' chant
chao 'stir-fry' chow
chi 'eat' chew
dadu 'bet' debt
dage ren '12 people' dozen
dai 'put on' tie 'fasten'
dan 'dawn' dawn
dao 'to' to
dei 'must' duty, due
dun 'ton' ton
er 'ear' ear
fazi 'way, means' fashion
fei 'fly' fly
feibo 'shabby, trifling' feeble
feishi 'troublesome, fussy' fussy
gang 'work collectively' gang 'group'
gei 'give' give
gouhe 'gully' gully
gu (W-G ku) 'cow' cow
guizi 'devil' ghost
guo 'pass through' go
hao 'hero' hero
hong 'hum of crowd' hum
huran 'suddenly' hurrying
ji 'mock' jeer
jiemei 'sisters' geminate
jueding 'decide' judge
junfa 'warlord' junta
kan 'read' ken
ken 'willing' can
keneng 'possible' can
kouyu 'spoken language' koine 'common language'
kuai 'fast' quick
kusi 'very similar' quasi
lazhu 'hold fast' lasso
lei 'flower bud' lei 'flower necklace'
lian 'connect' line
lianxi 'contact' link
libie 'leave' leave
long 'dragon' lion
long 'grand' long
loulie 'base, mean' lowly
luedi 'conquer' loot
ma 'mother' Ma
ma 'horse' mare
manbu 'stroll' mambo
meili 'beauty' mellifluous
meiju 'enumerate' measure
mian 'face' mien
miao 'mewing' mew, miaow
moter 'model' model
mubing 'raise troops' mobilize
mutong 'shepherd' mutton
nanti 'difficult, baffling' knotty
naiyou 'cream' mayo
pan 'plate' pan
paxiu 'shy' bashful
pei 'match, pair' pair
pei 'compensate' pay
po 'pour' pour
sha 'shark' shark
shafa 'sofa' sofa
shan 'mountain' (mountain) chain
shangai 'correct' change
shange 'folk song' song
shei 'who' she '3p fem. pron.'
shenti 'health' sanity
shezhi 'arrange, put' schedule
shechi 'shooting' shoot
shenshi 'gentleman' gentry
shi 'eat' chew
shi (pron. shr) 'true, real' sure
shi 'Mrs, Madam' she
shi 'see, examine' see
shifu 'master, expert' chief
shiming 'mission' scheme
shu 'school' school
shuo 'say, tell' show
si 'silk' silk
song 'give, send' send
songge 'ballad' song
soucha 'search' search
sunzi 'grandson' son
tamen 'they, them' them
tai 'too' too
ti 'tear' tear
tie 'stick on' tie
tou 'throw' throw
toupi 'deep' deep
wei 'weft' weft
weida 'great' wide
wen 'lukewarm' warm
yun 'iron' iron
xi 'drama, play' show
xiang 'sound' sound
xin 'suffering' sin
xinshi 'new style' ginchy
zeguo 'wetlands' soggy
zhuan 'turn' turn