Distributional arguments noch einmal

This is what I get for reading a table of contents announcement on LINGUIST List — specifically, for Journal of Linguistics 43.3 (here’s the link to the actual issue, in case you have access).

I got specifically interested in the Notes and Discussion section, where there are two articles: Dick Hudson‘s “Inherent variability and Minimalism: Comments on Adger’s ‘Combinatorial variability'” and David Adger‘s “Variability and modularity: A response to Hudson”. (Adger’s “Combinatorial Variability” is in JLing 42.3.)

Wait (I hear you say) — this is phonoloblog, not morphosyntactoblog (or whatever). Why am I interested in what Hudson has to say about Adger and vice-versa? Well, some of Hudson’s comments echo something I’ve brought up here a few times before, and the exchange between Hudson and Adger bears directly on some current work in phonology; specifically, some of the work that addresses variation.

The Hudson and Adger exchange is short (12 pp. and 6 pp., respectively), and what I’m interested in discussing here does not necessarily require reading Adger’s original article (28 pp.). Let me quickly summarize what is most relevant to this post.

Adger (2006) is a “plausibility argument for a new way of thinking about intra-personal morphosyntactic variation” (p. 503). (Note: Adger’s overall approach doesn’t sound so “new” to me, but I suppose it depends on how narrowly you construe the content of that “about” phrase.) In short, Adger argues that an observed 2:1 ratio of occurrence between two morphosyntactic forms (you/we was and you/we were) in Buckie English (the vernacular of a small, isolated Scottish community, as described in Jennifer Smith‘s 2000 U. of York dissertation) is the result of a lexicon in which there are two morphosyntactic items that both happen to spell out as was but only one that happens to spell out as were. The trick is this assumption:

“[I]f there is a random choice of which [lexical item x, y, or z] is entered into the system, then we should find x, y and z in equal proportions. However, if some of the PF outputs of the lexical items are the same, we predict a disproportionality in the final output […] [For example, suppose there are] two ways that the grammar can output an x, but only one way to make a z. We therefore predict a statistical variance in the output, such that we will find x more often than z.” (p. 510).

Herein lies the connection with my previous posts, in which I questioned the legitimacy of this kind of assumption in a few different contexts in the phonological literature.

  1. Distributional arguments (12/2/2004)
  2. More distributional arguments (1/17/2005)
  3. Still more distributional arguments (10/14/2005)

Hudson (2007: sec. 2.3, pp. 689-691) also questions this assumption, and offers good reasons for seriously doubting its validity. It’s worth reading the section in full, but I quote here the most salient passages.

The explanation is ingenious, but no evidence is offered for the underlying assumption that every lexical item has an equal chance of being used, which predicts that whenever two items share the same meaning, they should each have about 50% of the total usage. Common experience suggests that this is not so; for example, pairs of synonyms like try and attempt (as in try/attempt to open the door) offer speakers a lexical choice, but stylistic differences strongly favour try in ordinary casual conversation. Research evidence supports this conclusion. […] This conclusion is typical of findings in quantitative sociolinguistics, where the data normally show context-sensitive bias in favour of one of two synonymous alternatives […]. Adger’s theory therefore rests on the unsupported assertion that in general lexical choices are random: ‘I have assumed that there is a random choice of lexical items (that is, that there is an equal probability that any of the three lexical items is chosen)’ (p. 511). […]

Adger’s (2007: 699) defense of the relevant assumption is limited to a couple of paragraphs, slightly modified here to better fit my truncation of the Hudson passage above, and emphasis added.

The grammar G predicts n variants for a particular meaning with a uniform probability distribution. This uniform distribution of variants does not predict a uniform distribution of phonological forms, since the phonological forms are not themselves always evenly distributed over the variants. Moreover, in any particular speech event, the speaker’s choice of variant will be given by [the performance choice function] U, which is sensitive to speaker-internal properties such as intention, processing and memory and to (ultimately internalised) properties of the utterance context, such as who the interlocutor is and what conversation has gone before. Across sociolinguistically categorised groupings we may (or may not) see emergent patterns of higher or lower frequencies for particular forms. Collating the data together is one way to empirically bring out the effect of the uneven distribution of phonological forms over variants predicted by the theory. […] I assume that every lexical item has an equal chance of being used; that is, I assume a uniform probability distribution in the set of variants. [Hudson] claims that common experience shows this is not so […]. But this argument is backwards. It is an empirical finding that the distribution of [e.g. try and attempt] is non-uniform, and a departure from the null hypothesis. As is well known in Bayesian probability theory, it is crucial to assume a prior probability distribution and I simply assumed a uniform distribution, the null hypothesis; it would have been a curious move to assume anything else. That this assumption led to the correct predictions is itself an interesting finding.

In the underlined hedge in this passage, Adger acknowledges — but at the same time essentially dimisses — the influence of non-linguistic factors on lexical choice. A similar hedge makes a brief appearance in Adger (2006: 511):

Choice of a lexical item by a speaker in any particular utterance is potentially influenced by social and/or psychological factors, so that a particular lexical item may have a higher probability of being chosen in a particular utterance (for example, if that lexical item has been recently accessed, it may be easier to access again; or if a lexical item is simply more frequent overall, it may be easier to access). […] Assuming we can, in fact, control for input probabilities, what we have seen here is that the combinatorics of the syntactic system itself, working on the featural specifications of lexical items, predicts not only variability, but also particular frequencies of surface variants.

Popping back to Adger (2007: 699), we see this justification for the hedge:

[In Adger 2006] I explicitly discuss the fact that various factors will impact directly on the use of a particular variant in any speech event. My question was whether one could see a general pattern emerging when these factors were controlled for, and my suggestion was that such a pattern could be attributed to the structure of the pool of variants, and hence ultimately to the grammar. I argued that this was precisely what happened when we look at the patterns as a whole (see p. 527, where this is discussed).

But it’s also worth quoting what Adger (2006) says on p. 527 (in a single paragraph right before the “Conclusions and Implications” section):

The extra assumption I am making is that every community member will have the same grammar and that it is legitimate to collapse the data from a number of individuals into a single analysis. I think that this assumption is reasonably motivated by the fact that the general patterns seen across individuals hold, for the most part, within a single individual’s data (for example, all individuals have a categorical/variable split exactly as described here). However, it is true that there just isn’t enough data to be sure that the detailed FREQUENCY effects discussed here actually hold for every individual. This is a shortcoming of the analysis which I am aware of, and it is the reason that I offer this analysis as a plausibility argument rather than as a detailed empirical study.

My view (and Hudson’s, as I read it) is that Adger’s “plausibility argument” is founded on at least the following four assumptions, all but #3 of which are problematic (or at least questionable):

  1. Assumption 1. Individual grammars are significantly responsible for the average distribution of forms across a (sample of) a speech community.
  2. Assumption 2. Forms are evenly distributed in our mental lexicons in some significant sense.
  3. Assumption 3. There is lexical homophony (specifically, the was of I was and the was of you was are different lexical items that happen to both be spelled out as was, while the were of we were and of you were are one and the same.)
  4. Assumption 4. The “average distribution of forms across a (speech sample of) an entire community” of Assumption 1 significantly reflects Assumptions 2 and 3.

[Note, 4pm: Assumption 1 has been re-written since I posted this earlier today.]

Looking at it this way, these four assumptions pretty straightforwardly map onto assumptions made in e.g. work on variation by Arto Anttila. Those assumptions, as I understand them, are as follows.

  1. Assumption 1. Individual grammars are significantly responsible for the average distribution of forms across a (sample of) a speech community.
  2. Assumption 2. The members of the set of totally-ordered constraint rankings consistent with a given partial order are evenly distributed in our mental grammars in some significant sense.
  3. Assumption 3. Some different total orders of constraints result in identical surface forms.
  4. Assumption 4. The “average distribution of forms across a (speech sample of) an entire community” of Assumption 1 significantly reflects Assumptions 2 and 3.

Note that Assumptions 1 and 4 are identical to Adger’s assumptions, whereas Assumptions 2 and 3 are modified only slightly to accomodate the different basic tools of analysis, rankings vs. lexical forms. I’ve already noted that I think Adger’s Assumption 3 is not particularly questionable, and I feel the same way about Anttila’s corresponding assumption. But Assumptions 1 and 2 (and by implication, Assumption 4) are no safer, in my view, in Anttila’s theory than they are in Adger’s. Does anyone care to defend these assumptions further, or to explain to me that I’ve got the assumptions all wrong or something? I’d love some discussion of all this.

7 thoughts on “Distributional arguments noch einmal

  1. Lucien

    I think Assumption 1 is something more like “are significantly reflective of”. If your data sample is drawn from approximately the same distribution as the speakers’ linguistic input, then the variation from individual grammars should be comparable to variation among speakers. This is probably a better assumption for receptive grammar than for productive grammar.

  2. Marc

    I’ve always found assumption 1 in Anttila’s (1997) approach to variation tricky for the same reasons Hudson objects to Adger. For me, it comes down to the fact that variation is rarely, if ever, “free”. Sociolinguistics has shown us over and over that some aggregate rate of variation for some language glosses over more precise measures that we could obtain for smaller speech communities – conceivably all the way down to the rate of variation for some individual in some particular social context. Plus, we know that frequency has a strong impact on variation (Bybee et seq.). These factors are what makes Anttila’s quantitative results for Finnish even more amazing. I don’t know, however, if his approach has been replicated consistently for other examples. Ash and Asudeh (2002) also get into some of these issues – very incisively IMHO – for both syntax and phonology.

    So, if myriad “extra”-linguistic factors condition variation as much as syntactic/phonological factors and we endeavor to include variation in our grammars, the question becomes, should we be including all of these factors as constraints in our grammars?

    The timing of your post is excellent as there was a good talk at NELS just this past Friday addressing just that. The answer given by Andries Cotzee is, yes, at least as far as frequency is concerned (with the door being presumably open to other factors). He suggested that frequency determines what probability distribution should be assigned to a particular constraint. This distribution allows the cosntraint to range over some other constraint producing different rates of variation for different lexical item. This seems like a great step in the right direction of accounting for rates of variation using grammar, rather than shunting it off to “performance”.

  3. Eric Bakovic

    Thanks for the comments, Lucien and Marc. For those who missed it, here’s Coetzee’s abstract. And I’m sure Marc meant Frank Keller & Ash Asudeh (2002), which appeared in LI 33.2, pp. 225-244 (abstract, .pdf).

    I’m still wondering if there’s really a defense for Assumption 1. I understand what Lucien is saying — a large sample may correlate with what individuals receive as input — and I’m strongly inclined to agree with Lucien that “[t]his is probably a better assumption for receptive grammar than for productive grammar.” (Though what that ends up cashing out as, exactly, is unclear to me.)

    It’s been a while since I read Keller & Asudeh’s paper, and I look forward to re-reading it in light of all this. But just skimming through it quickly, I came across this relevant passage on p. 240:

    In our view the right way of conceptualizing the difference between frequency and gradient grammaticality follows from basic assumptions about competence and performance advocated by Chomsky (1965, 1981, 1995) and many others (for a review see Sch├╝tze 1996). The frequency
    of occurrence of a structure has to do with how the speaker processes this structure and is therefore a performance phenomenon. The degree of grammaticality of a structure, on the other hand, has to do with the speaker’s knowledge of language and is therefore part of linguistic competence.

  4. David Adger

    Just a brief comment on Eric’s decomposition of my argument. I think assumption 1 might be nicely defended by the work on acquisition of variable input in the dialect concerned by Jen Smith (journals.cambridge.org/production/action/cjoGetFulltext?fulltextid=658668). This work shows that kids track the frequencies of input rather directly in their output. I mention this work as a problem for Hudson’s story in my 2007 note.

    One thing Eric misses out is that the ‘pool of variants’ from which the choice is made of a particular variant in any occasion of use is determined by a general algorithm, rather than by stipulation. This algorithm predicts the homonymy of elements of the pool of variants and my only claim is that this will have a significant input on the final probability that a particular surface form (not lexical item) will appear.

    I’m intrigued to know of other approaches to morphosyntactic variation which are the same as mine. I realise that Antilla’s model and mine share a number of architectural assumptions, but there are differences. Note that I say that it’s a new way of thinking about MORPHOSYNTACTIC variation, and I refer to Antilla’s work. Further, he doesn’t apply it to morphosyntax, and I think that if you did apply Antilla’s approach to morphosyntax, you’d have something significantly different to mine. He essentially proposes a lattice of grammars, rather than a single grammar, as I do. I think the issues in morphosyntax and in phonology are interestingly different, and approaches to this problem in morphosyntax have almost always either denied modularity (like hudson) or defended multiple grammars (like Kroch). This is because the notion of variation in syntax is more problematic than in phonology, given that many cases end up correlating with semantic differences, and it is not clear how to calculate whether these should ‘count’. See the discussion in the socio literature between Lavendera, Labov, Romaine, Cheshire etc in the 80s.

  5. Eric Bakovic

    Thanks, David, for contributing to this discussion. I’m particularly eager to look closely at the Smith, Durham & Fortune article you cite to see more evidence for Assumption 1, though I’m curious why your 2007 note does not cite this article in support of this assumption, but rather in the context of an argument about how the categorical ungrammaticality of *they was must “be stated as a grammatical fact” that you doubt a “usage-based account [can] capture” (p. 698).

    Interestingly, this connects with your comment about your algorithm (stated in (27) on p. 518 of Adger (2006), for those following along — you’re right to note that I didn’t bother to mention the algorithm, but I didn’t say or imply that you had stipulated your lexical entries). I followed your algorithm (with some difficulty, I might add) using the following hypothetical input, an input just like the Buckie facts in (34) of your 2006 article, except that “they was/were” is variable.

    I waswe was/were
    you was/wereyou was/were
    he/she/it wasthey was/were

    The result is the following lexicon, which by your account predicts a 2-to-1 ratio between was and were for we and you, but 50/50 odds of was and were for they.

    [usg:+] was[uprt:-] was
    [usg:-] were[uau:+] was
    [uprt:+] was[uau:-] were

    Does your account capture the “grammatical fact” of *they was? No, except in what is essentially a usage-based manner: learners of Buckie never hear “they was”, which guides them to the conclusion (via the algorithm that creates the lexical items) that it is ungrammatical.

    Finally, a minor correction to a point in your third paragraph: Anttila proposes a lattice of rankings. You (or Anttila) can assume that different rankings are different grammars if you want, but what “significantly different” claims does this assumption make compared to the assumption that the entire lattice of rankings is a “single grammar”? I’m honestly curious about what difference this makes.

  6. David Adger

    I didn’t cite that article in support of your assumption 1, because I didn’t decompose my own argument in the way you did, perhaps wrongly. I was rather trying to address the points that Hudson makes.

    You’re right that my algorithm behaves just in the way that you say. Now, it turns out that there’s another variable agreement process in Buckie that interacts with was/were. In present tense of all verbs, third person plural full DPs give rise to variable agreement (so you get The mothers are/is roaring; The boats sink/sinks; etc – NB, these are just made up examples since I don’t have the corpus to hand, but the pattern is right). However, the same fact holds: the third person plural pronoun categorically appears with the plural agreement (*they is roaring; *They sinks) – so there’s a grammatical fact that goes beyond what the kids hear. My algorithm does, in fact, capture these grammaticality patterns correctly. Now, there’s no serious poverty of stimulus question here, so I think it’s probably true that a usage based account can capture these patterns, if it is augmented by a stipulation about the morphological decomposition of verbs. Interestingly, though, these patterns also hold when the pronoun is non-adjacent to the verb (e.g. in VP coordination structures or in cases where adjacency is disrupted by an adverb), which isn’t captured on a usage based account without yet more stipulations. All of this is straightforwardly handled by stating this stuff as a grammatical fact. I gave a paper about this in York recently, and should have a worked up article fairly soon.

    Finally, this is probably my ignorance of current phonological theory, but I thought that a ranking of constraints was what constituted a grammar.

    There’s another interesting issue in morphosyntax about what constitutes a grammar, if a grammar is universal principles plus a lexicon, and that’s the issue of whether there are doublets in the lexicon (see Kroch’s 1994 contribution on this). Some theories elevate this to an architectural principle which does explanatory work for them (e.g the subset principle as used by Halle and others in DM). If you do this, then I think that this really does give you different grammars in a meaningful sense.

  7. Matt Goldrick

    This is really, really late to the discussion–which means it probably won’t get read–but I wanted to register my discomfort with the assumption of equiprobability of underlying forms. The general problem that is avoided by proponents of stochastic grammars is one of ‘density estimation’–figuring out the probability distribution of (unobserved) underlying forms (this issue is mention in passing in this paper by Chris Manning). In fact this issue is considerably more complex than most density estimation problems attacked in other domains, since in this case there is not only an unknown underlying distribution by also an unknown stochastic process that relates underlying to surface forms.

    The one time I actually tried to confront this problem was with Lisa Davidson. The project was on syntax acquisition, quite outside our domain of expertise. In this case, we used the distribution of adult productions to estimate the distribution of underlying input forms (making the strong assumption that adult utterances were essentially all faithful expressions of the input). We then calculated ranking probabilities after factoring out the effect of differences in the probability of these different inputs. See p. 10-14 of this paper for details.

    I certainly don’t think this is the most sophisticated example of this type of work–I’d be interested in reading any other work that has attempted to address this issue.

Leave a Reply

Your email address will not be published. Required fields are marked *