Distributional arguments

A few days ago, I re-read the following argument by McCarthy & Prince (1993:181):

(1)	The (velar glide-final) Axininca Campa root /iraɰ/ behaves as if it were /raɰ/; that is, a single syllable as opposed to two (for the purposes of the phonology of the velar glide).
(2)	Suppose that the /i/ in /iraɰ/ (and in all /ir/-initial roots) is epenthetic, and that the monosyllabic behavior of /iraɰ/ is calculated before epenthesis applies (or however epenthetic segments are ignored).
(3)	As it turns out, /r/-initial roots are unknown in Axininca Campa, save for a single borrowing (rapisi ‘pencil’, from Spanish lápiz). This is expected if underlyingly /r/-initial roots undergo /i/-epenthesis, becoming /ir/-initial roots.
(4)	Furthermore, /ir/-initial roots are far more common than other /Vr/-initial roots. This is expected if /ir/-initial roots have two underlying sources, as opposed to only one for other /Vr/-initial roots.

(3) may already be convincing enough for some folks to believe (2) as an explanation for (1). (Note: the empirical claim in (3) is based on “an examination of [the] root lexicon of [David Payne’s (1981) The Phonology and Morphology of Axininca Campa], containing approximately 850 entries”.) I’m not going to address that here; what I’m interested in is (4), which appears to rely on the following (unstated) assumption:

(5)	Underlyingly, all segmental strings (of equal length) have equal distributions (= probabilities of occurrence).

I find this assumption to be less than convincing, though perhaps I wouldn’t have blogged about it if I hadn’t heard a talk yesterday in which a very similar (also unstated) assumption was invoked. With Geoff Pullum’s OICTIQ principle firmly in mind, I thought I’d investigate further.

Yesterday’s talk concerned facts about Mundurukú (a.k.a. Mundurucú, last discussed on Language Log here— just for your blogospheric reference). Very brief background on yesterday’s speaker: Gessiane Picanço is a UBC graduate student from Brazil who is doing some very interesting work on the phonetics, phonology, and diachrony (through comparative reconstruction) of the Mundurukú subfamily of Tupi languages. (Tupi also includes the Guaraní subfamily, with which probably more of us are familiar.) Picanço also delivered a great talk on Mundurukú phonation types at WECOL last month, in case you missed it.

Picanço’s talk yesterday (part of the research seminar at UBC Linguistics where both faculty and students present current research) was about the possible diachronic sources of some phonotactic restrictions in Mundurukú. Picanço counted the relative distributions of consonants and vowels in 1,252 CV(C) syllables (one instance per alternant per morpheme from Picanço’s fieldnotes); among other interesting facts discussed in the talk, Picanço found the facts in (6) and (7) below.

(6)	There are 37 instances of /ʃi/ sequences in Mundurukú. (That’s just a hair under 3% of the counted CV(C) syllables.)
(7)	/si/ sequences, on the other hand, are nonexistent. (One exceptional /si/ sequence exists in a borrowing, pasí ‘go for a walk’, from Portuguese passear.)

Picanço considers and rejects the synchronic analysis in (8), plausible-seeming though it may be. (See note 1 below.)

(8)	Suppose that the distribution in (6) and (7) is due to a palatalization rule, s → ʃ / __ i.

Instead, Picanço argues that the absence of /si/ in Mundurukú may be an “emergent phonotactic”, arising from the vagaries of regular sound change. Picanço offers comparative evidence between Mundurukú and Kuruaya (another Mundurukú language), showing that Mundurukú /ʃi/ has at least diachronic sources, **/ci/ and **/ki/ (9) – but, crucially, not **/si/ (10) (where the double asterisks in the preceding denote reconstructed Proto-Mundurukú sequences).

(9)	Mundurukú	Kuruaya	gloss	(10)	Mundurukú	Kuruaya	gloss
	ʃĩn	kĩn	‘pancake’		wásə͂	osĩ	‘bird’
	ʃijáp	kidap	‘shelter’		soé-dəp	ísie	‘fish, sp.’
	taʃíp	takip	‘it’s hot’		məsə́k-ta	másik	‘manioc’
	o-ʃeé	o-kíe	‘my skin’		kosə́-da	kósi-la	‘babaçu (plant, sp.)’
	o-iʃít	we-icit	‘my younger sister’		ipádá	sípala	‘macaw, sp.’
	o-tayʃi	o-taici	‘my wife’		ipóró	sípɔrɔ	‘wild cat’
	potíp-ʃíʃí	pótip-ci	‘fish, sp.’		o-í-pik	o-si-pik	‘it burned’

Regardless of whether you’re partial to the synchronic analysis or the diachronic one, there is also the following to consider:

(11)	/ʃi/ sequences are far more common than both other /Ci/ sequences and other /ʃV/ sequences. This is expected if /ʃi/ sequences have either two underlying or historical sources, as opposed to only one for both /Ci/ sequences and /&#643V/ sequences.

Sound familiar? If not, recall (4) above. Like (4), (11) also appears to be based on the same assumption cited earlier in (5), repeated here as (12) and modified to include the diachronic analytical possibility:

(12)	Underlyingly/historically, all segmental strings (of equal length) have equal distributions (= probabilities of occurrence).

Given what (I think!) we know about the myriad factors that contribute to the inequality of surface string distributions in a given language, the assumption in (12) just seems like a non-starter to me. (This is most obvious to me in the historical case, where what we’re comparing are two sets of surface distributions; one is reconstructed, but that’s beside the point.) If I’m right, then I think that any argument based on this assumption — such as the arguments in (4) and (11) — is invalid. (Of course, both Picanço and McCarthy & Prince offer other arguments for their respective claims, which must be assessed on their own. As already noted, I have nothing in particular to say about those other arguments here, but see note 2 below.)

But I could also be missing the point; after all, the assumption in (12) is not stated in either of the works cited above (as I have already noted). The arguments in (4) and (11) are not even pursued very far by the respective authors; numbers are cited and pointed to in the relevant discussion, but that’s about it. Note that in addition to (12), I’m also roughly inferring the “two underlying/historical sources vs. one” thing in (4) and (11); neither Picanço nor McCarthy & Prince invoke any such numbers in this context. I recognize that in both cases things are more complicated than two vs. one, and that the distributional numbers are not expected to correlate exactly with the relative numbers of underlying/historical sources — but my point is, do we even expect them to correlate somewhat, or for any correlation we may find to be a positive indication of one-way or mutual influence?

Further facts to consider:

(i)	There are 35 instances of /sə/ sequences in Mundurukú. (About 2.8% of the counted syllables.)
(ii)	/ʃə/ sequences are, again, nonexistent. (With no exceptions this time.)
(iii)	To complete the picture, /&#643V/ and /sV/ sequences are otherwise (more-or-less) contrastive: /ʃe/ = 17 (1.36%), /se/ = 11 (0.88%); /ʃa/ = 8 (0.64%), /sa/ = 12 (0.96%); /ʃo/ = 2 (0.16%), /so/ = 13 (1.04%).

One is thus tempted to say that synchronic phonological processes exist that neutralize the contrast between /ʃ/ and /s/ in favor of /ʃ/ before /i/ and in favor of /s/ before /ə/. In support of at least half of this line of attack, Picanço has found a single form: i-poʃí ‘it’s heavy’, when suffixed with the partially reduplicative suffix -Cə́ ‘not so’, becomes i-poʃí-sə́ ‘it’s not so heavy’, not *i-poʃí-ʃə́. One could either assume that the root here is /poʃí/ or /posí/, the latter undergoing /s/ → ʃ / __ i, and that the ‘not so’ suffix surfaces with /s/ either due to /&#643/ → s / __ ə or because the root has /s/ to begin with.

2.	Note that I’ve been putting aside the undoubtedly important distinction between small word-list counts (such as those of both Picanço and McCarthy & Prince) and larger, natural corpus counts (which may or may not even exist for these two languages).

5 thoughts on “Distributional arguments”

Bob Kennedy December 6, 2004 at 4:40 pm

The more I think about it, the more it seems like (5) is not the only possible (unstated) assumption behind (4).

Here’s an alternative assumption: suppose, generally, C-initial stems are more frequent than V-initial stems. Suppose the ratio is actually such that for any C and V, CV- is more frequent (underlyingly) than V-. That is, /ba…/ is more frequent than /a…/, /da…/ is more frequent than /a…/, etc., through all the consonants and vowels.

In this system, /ra…/ would be more frequent than /a…/, and /ri…/ more frequent than /i…/. But these (relatively higher-frequency) CV strings come out as [ira…] and [iri…]. Given that /rV…/ by itself is more frequent underlyingly than /V…/, the higher surface inventory of [ir…] relative to other [Vr…] strings seems to follow.

The assumption in (5) is that all segment string pairings have equal underlying probabilities. So the surface frequency of [ir…] is a sum of the (roughly equal) frequency of /r…/ and /ir…/, which is greater than the (as equal) frequency of /ar…/, /or…/, etc.

The assumption I just made is different: the surface frequency of [ir….] is a sum of the low frequency of /ir…/ and the high frequency of /r…/, which is greater than the low frequency of /ar…/, /or…/, etc.

The upshot is that the lexicon can have a skewed distribution rather than an even one, and the claim in (4) (or one like it) can still follow. At least, this seems to be the case for the Axininca Campa data.
John McCarthy January 9, 2005 at 8:32 am

The question raised by Bob Kennedy is already addressed in McCarthy & Prince (1993: 171):
“… roots with initial ir are surprisingly common: of 60 i-initial roots, 11 begin with ir … (As a control, consider the fact that, of 90 a-initial roots only 2 begin with ar, and of 68 o-initial roots, none at all begin with or.” Axininca has no other vowels.
Pingback: phonoloblog » More distributional arguments
Pingback: phonoloblog » Still more distributional arguments
Pingback: phonoloblog»Blog Archive » Distributional arguments noch einmal

phonoloblog

all things phonology | quote.ucsd.edu/phonoloblog

Distributional arguments

5 thoughts on “Distributional arguments”

Leave a Reply