How likely are chance resemblances between languages?

[ Home ] [ Top of paper ] [ Remainder of paper ]

Taking phoneme frequencies into account

This page, as an example of analysis, will consider random matches between Quechua and Chinese, taking into account the fact that phonemes don't occur with equal probability.

First, we need to decide what constitutes a phonetic match between the two languages. One way of doing this is to decide for each Quechua phoneme what Chinese phonemes we'll accept as matches. (Think of it this way: is Qu. runa a match for Ch. rén? Is chinchi a match for chong? Is chay a match for zhè?)

We might decide as follows. The criterion here is obviously phonetic similarity. We could certainly improve on this by requiring a particular phonological distance; e.g. a difference of no more than two phonetic features, such as voicing or place or articulation. The important point, as we will see, is to be clear about what we count or do not count as a match; or if we are evaluating someone else's work, to use the same phonetic criteria they do.

Qu. Ch.
p p, b
t t, d
ch ch, zh, j, q, c, z
k k, g
s s, sh, c, z, x, zh
h h
q h, k
m m, n
n m, n, ng
ñ m, n, ng, y
l l, r
ll l, r, y
r l, r
w w, u
y y, i
a a, e, o
i i, e, y
u u, o, w

We will next need to know the frequency with which each phoneme occurs in each language. This can be calculated using a simple program operating on sample texts. For Quechua we find:

initial medial final
a 5.291005 25.906736 40.211640
b 2.645503 0 0
d 0 0.310881 0
g 0.529101 0.103627 0
h 5.820106 0 0
i 2.645503 8.808290 5.291005
k 14.814815 5.595855 3.174603
l 0.529101 0.414508 0
m 7.407407 4.145078 3.703704
n 1.587302 6.528497 25.396825
p 7.936508 6.010363 0
q 4.232804 3.108808 8.465608
r 4.232804 5.077720 0
s 6.349206 4.145078 2.645503
t 7.407407 6.424870 0
u 3.703704 11.398964 2.645503
w 11.111111 1.450777 0.529101
y 3.174603 4.145078 7.936508
ch 6.878307 3.108808 0
ñ 1.058201 1.243523 0
rr 0.529101 0 0
ll 2.116402 1.865285 0
And for Chinese we get:

initial medial final
a 1.400000 21.494371 7.739308
b 7.000000 1.432958 0
c 0.600000 0.102354 0
d 12.800000 1.228250 0
e 0.200000 8.904811 15.885947
f 2.000000 0.614125 0
g 3.200000 1.842375 0
h 3.400000 2.149437 0
i 0 17.195496 29.327902
j 4.600000 1.944729 0
k 2.200000 0.204708 0
l 6.000000 2.149437 0
m 2.600000 1.330604 0
n 3.800000 6.038895 11.608961
o 0.400000 7.881269 9.368635
p 1.000000 0.102354 0
q 2.000000 1.842375 0
r 0.800000 0.307062 1.629328
s 0.800000 1.023541 0
t 3.800000 1.228250 0
u 0 8.495394 12.016293
w 7.800000 0.716479 0
x 4.200000 0.614125 0
y 9.600000 0.511771 0
z 4.200000 1.023541 0
ch 2.200000 0.716479 0
ng 0 5.834186 12.016293
sh 7.800000 1.330604 0
zh 5.600000 1.740020 0

(The reader who knows Chinese may wonder how we can have medial consonants at all. The answer is that I am using Chinese lexemes, not single characters (zì), so that, for instance, Zhongguó 'China' is one word, not two.)

Now we're in a position to calculate the probability for a match. Let's start by assuming that there must be a match (within the phonetic categories established above) in both initial, medial, and final.

To calculate the probability pi for a match in the initial, we go down the list of Quechua initials, multiplying its probability times the probability of finding the matching sound(s) in that same position in Chinese. For instance, the probability of a match on initial p is the probability of initial p in Quechua (.0794) times the probability of a match on initial p or b (.07 + .01 = .08), or .00635.

I show the entire calculations below, because some of them are quite eloquent, and show the value of taking a frequency approach. If you're looking for a match for a Quechua word in s-, for instance, you have a 23% chance of matching any of the sounds we've judged as similar in Chinese. You're likely to match medial -a- 38% of the time; final -a 33% of the time, final -n 24% of the time.

(The boldface letter is the Quechua sound; it's followed by the Chinese sounds we said would be a match. The first number is the probability of the Quechua phoneme; the second is the sum of the probabilities of the matching Chinese sounds; the third is the multiplication of the first two.)

Initials

a aeo .05291 * .020 = .00106
h h .05820 * .034 = .00198
i iey .02646 * .098 = .00259
k kg .14815 * .054 = .00800
l lr .00529 * .068 = .00036
m mn .07407 * .064 = .00474
n mn ng .01587 * .160 = .00254
p pb .07937 * .080 = .00635
q hk .04228 * .056 = .00237
r lr .04228 * .068 = .00288
s s sh c z x .06349 * .232 = .01473
t td .07407 * .166 = .01230
u uow .03704 * .082 = .00304
w wu .11111 * .078 = .00867
y yi .03174 * .096 = .00305
ch ch zh jqcz .06883 * .192 = .01322
ñ mn ng y .01058 * .160 = .00169
ll lry .02121 * .164 = .00348
Probability for an initial match = .09305 = 9.3%

Medials

a aeo .25907 * .3828 = .09917
i iey .08808 * .2661 = .02344
k kg .05596 *.0205 = .00114
l lr .00415 * .0246 = .00010
m mn .04145 * .0737 = .00305
n mn ng .06528 * .1320 = .00862
p pb .06010 * .0153 = .00092
q hk .03109 * .0235 = .00073
r lr .05078 * .0246 = .00125
s s sh c z x .04145 * .0582 = .00241
t td .06425 * .0246 = .00158
u uow .11399 * .1710 = .01949
w wu .01451 * .0921 = .00134
y yi .04145 * .1771 = .00734
ch ch zh jqcz .03109 * .0736 = .00229
ñ mn ng y .01244 * .1371 = .00170
ll lry .01865 * .0297 = .00055

Probability for a medial match = .17514 = 17.5 %

Finals

a aeo .40212 * .3299 = .13266
i iey .05291 * .4522 = .02393
k kg .03175 * 0 = 0
m mn .03704 *.116 = .00430
n mn ng .25397 *.236 = .05994
q hk .08466 * 0 = 0
s s sh c z x .02646 * 0 = 0
u uow .02646 * .2139 = .00566
w wu .00529 * .1202 = .00064
y yi .07937 * .2933 = .02328

Probability for a final match = .25039 = 25.0 %

So, the probability of finding a random match on a single word (with no semantic leeway) is .0931 * .1751 * .2504 = 0.0041, or 1 in 244.

Was all that worth it?

It's worthwhile comparing this to the original seat-of-the-pants estimate (based on 14 equiprobable consonants and 5 equiprobable vowels, and allowing 3 phonetic matches per sound) of 27 in 980, or 0.027-- 6.5 times the above frequency.

Two lessons may be drawn. First, phoneme frequency matters. Both Quechua and Chinese have very many medial a sounds, and final nasals, and initial affricates. That makes random matches involving those sounds much more likely.

Second, seemingly minor points of procedure have a huge impact on our results. We are used to situations where rough calculations do not lead us far astray. But in this area differing assumptions or methodologies lead to very different results. Very careful attention to both is warranted.

Additional types of match

We can also answer the question posed above: with the phonetic criteria as given, neither runa/rén, nor chinchi/chong, nor chay/zhè are matches. Yet a comparer would probably set great store on each of them.

Obviously the initial-medial-final calculation is still a simplification. Quechua, for instance, can have both initial and final consonant clusters; both languages have some two-phoneme roots; and of course a vague "medial" category is not a good way of handling multisyllabic words.

We might decide to allow a Quechua medial to match either a Chinese medial or final, to catch resemblances like runa/rén and chinchi/chong. To do this we need to compute the chance that a Quechua medial matches a Chinese final, as follows. (We can skip Quechua medials for which none of the corresponding Chinese sounds can end a word.)

Medial-to-final

a aeo .25907 * .3300 = .08549
i iey .08808 * .4521 = .03982
l lr .00415 * .0163 = .00007
m mn .04145 *.1161 = .00481
n mn ng .06528 * .2363 = .01543
r lr .05078 * .0163 = .00083
u uow .11399 * .2138 = .02437
w wu .01451 * .1202 = .00174
y yi .04145 * .2933 = .01216
ñ mn ng y .01244 * .2363 = .00294
ll lry .01865 * .0163 = .00030
Probability for a medial-to-final match = .18796 = 18.8 %

This can be added to the previous medial-to-medial estimate, on the grounds that when a medial doesn't match another medial, we're giving it another chance to match a final. However, the additional chance should be discounted by the probability (30% in my sample Chinese text) that the initial and final are the same (that is, that the word is just two phonemes long). So the medial-to-medial-or-final probability is .1751 + (.1880 * .70) = .3067.

The probability of finding a random match on a single word (no semantic leeway) can now be given as .0931 * .3067 * .2504 = 0.0071.

This estimate could be revised still further to take account of such things as metathesis (switched consonants), or Quechua's initial consonant clusters. Note that both examples allow additional matches, and thus will increase p even more.

Matching just two phonemes

We still haven't really taken account of chay/zhè (nor of runa/rén, since we decided above that u and e don't match). The probabilities calculated so far require three phonemes to match. It might be interesting to know the probability that just two phonemes match.

Since this probability is obviously going to be much higher, I don't recommend trying to combine both types of match into a single p, which would understate the difficulty of finding 3-phoneme matches and overstate that of 2-phoneme matches.

We can estimate the probability of a 2-phoneme match by using the probability of a match on initials times that of a Quechua medial matching a Chinese medial or final: .0931 * .3066 = .0285 or about 1 in 35.

This could be refined by adding the probability that a Quechua final matches a Chinese medial or final, this time discounted by the probability that the Quechua medial is also the final.

An alternative approach

If you want to avoid phonetic calculations entirely, there's an alternative approach: We pick a word a in A, then pick the word b in B which most closely resembles it phonetically. To handle phonetic looseness, we pick the n words in B which most closely resemble it phonetically.

The advantage is that we don't have to mess with phonetic details or how to match the phonologies of different languages. We can proceed quickly to an estimate of how many matches we can expect to find in general between two languages.

The disadvantage is that this approach doesn't lend itself to evaluating other people's claims. You can picture (say) Greenberg & Ruhlen examining the n words in Tfaltik that most closely resemble maliq'a. But what is their n? To give a reasonable estimate we have to dive back into phonetic details and probabilities.

[ Home ] [ Top of paper ] [ Remainder of paper ]

Qu.	Ch.
p	p, b
t	t, d
ch	ch, zh, j, q, c, z
k	k, g
s	s, sh, c, z, x, zh
h	h
q	h, k
m	m, n
n	m, n, ng
ñ	m, n, ng, y
l	l, r
ll	l, r, y
r	l, r
w	w, u
y	y, i
a	a, e, o
i	i, e, y
u	u, o, w

	initial	medial	final
a	5.291005	25.906736	40.211640
b	2.645503	0	0
d	0	0.310881	0
g	0.529101	0.103627	0
h	5.820106	0	0
i	2.645503	8.808290	5.291005
k	14.814815	5.595855	3.174603
l	0.529101	0.414508	0
m	7.407407	4.145078	3.703704
n	1.587302	6.528497	25.396825
p	7.936508	6.010363	0
q	4.232804	3.108808	8.465608
r	4.232804	5.077720	0
s	6.349206	4.145078	2.645503
t	7.407407	6.424870	0
u	3.703704	11.398964	2.645503
w	11.111111	1.450777	0.529101
y	3.174603	4.145078	7.936508
ch	6.878307	3.108808	0
ñ	1.058201	1.243523	0
rr	0.529101	0	0
ll	2.116402	1.865285	0

a aeo	.05291 * .020 =	.00106
h h	.05820 * .034 =	.00198
i iey	.02646 * .098 =	.00259
k kg	.14815 * .054 =	.00800
l lr	.00529 * .068 =	.00036
m mn	.07407 * .064 =	.00474
n mn ng	.01587 * .160 =	.00254
p pb	.07937 * .080 =	.00635
q hk	.04228 * .056 =	.00237
r lr	.04228 * .068 =	.00288
s s sh c z x	.06349 * .232 =	.01473
t td	.07407 * .166 =	.01230
u uow	.03704 * .082 =	.00304
w wu	.11111 * .078 =	.00867
y yi	.03174 * .096 =	.00305
ch ch zh jqcz	.06883 * .192 =	.01322
ñ mn ng y	.01058 * .160 =	.00169
ll lry	.02121 * .164 =	.00348

a aeo	.25907 * .3828 =	.09917
i iey	.08808 * .2661 =	.02344
k kg	.05596 *.0205 =	.00114
l lr	.00415 * .0246 =	.00010
m mn	.04145 * .0737 =	.00305
n mn ng	.06528 * .1320 =	.00862
p pb	.06010 * .0153 =	.00092
q hk	.03109 * .0235 =	.00073
r lr	.05078 * .0246 =	.00125
s s sh c z x	.04145 * .0582 =	.00241
t td	.06425 * .0246 =	.00158
u uow	.11399 * .1710 =	.01949
w wu	.01451 * .0921 =	.00134
y yi	.04145 * .1771 =	.00734
ch ch zh jqcz	.03109 * .0736 =	.00229
ñ mn ng y	.01244 * .1371 =	.00170
ll lry	.01865 * .0297 =	.00055

a aeo	.40212 * .3299 =	.13266
i iey	.05291 * .4522 =	.02393
k kg	.03175 * 0 =	0
m mn	.03704 *.116 =	.00430
n mn ng	.25397 *.236 =	.05994
q hk	.08466 * 0 =	0
s s sh c z x	.02646 * 0 =	0
u uow	.02646 * .2139 =	.00566
w wu	.00529 * .1202 =	.00064
y yi	.07937 * .2933 =	.02328

a aeo	.25907 * .3300 =	.08549
i iey	.08808 * .4521 =	.03982
l lr	.00415 * .0163 =	.00007
m mn	.04145 *.1161 =	.00481
n mn ng	.06528 * .2363 =	.01543
r lr	.05078 * .0163 =	.00083
u uow	.11399 * .2138 =	.02437
w wu	.01451 * .1202 =	.00174
y yi	.04145 * .2933 =	.01216
ñ mn ng y	.01244 * .2363 =	.00294
ll lry	.01865 * .0163 =	.00030