Faculty of Language: Where the estimable Mark Johnson corrects Norbert (sort of)

Thursday, April 23, 2015

Where the estimable Mark Johnson corrects Norbert (sort of)

For various reasons, Mark J could not post this as a comment on this post. As he knows much more than I do about these matters, I thought it a public service to lift these remarks from the comments section to make them more visible. I think that this is worth reading in conjunction with Charles' recent post (here). At any rate, this is interesting stuff and I don't disagree much (or have not found reasons to disagree much) with what Mark J says below. I will of, course, allow myself some comments later. Thx Mark.

***

This was originally written as a comment for the "Bayes and Gigerenzer" post, but a combination of a length restriction on comments and my university's not enabling blog posts from our accounts meant I had to email Norbert directly.

As Norbert has remarked, Bayesian approaches are often conflated with strong empiricist approaches, and I think this post does that too. But even within a Bayesian approach, there are powerful reasons not to be a "tabula rasa" empiricist. The "bias-variance dilemma" is a mathematical statement of something I've seen Norbert say in this blog: learning only works when the hypothesis space is constrained. In mathematical terms, you can characterise a learning problem in terms of its bias -- the range of hypotheses being considered -- and the variance or uncertainty with which you can identify the correct hypothesis. There's a mathematical theorem that says that in general as the bias goes down (i.e., the class of hypotheses increases) the variance increases.

Given this, I think a very reasonable approach is to formulate a model that includes as much relevant information from universal grammar as we can put into it, and perform inference that is as close to optimal as we can achieve from data that is as close as possible to what the child receives. I think this ought to be every generative linguist's baseline model of acquisition! Even with an incomplete model and incomplete data, we can obtain results of the kind "innate knowledge X plus data Y can yield knowledge Z".

But for some strange (I suspect largely historical) reason, this is not how Chomskyian linguists think of computational models of language acquisition. Instead, they prefer ad hoc procedural models. Everyone agrees there has to be some kind of algorithm which children use to learn language. I know there are lots of pyschologists who are sure they have a good idea of the kinds of things kids can and can't do, but I suspect nobody really has the faintest idea of what algorithms are "cognitively plausible". We have little idea of how neural circuitry computes, especially over the kinds of hierarchical representations we know are involved in language. Algorithms which can be coded up as short computer programs (which is what most people have in mind when they say simple) might turn out to be neurally complex, while we know that the massive number of computational elements in the brain enable it to solve computationally very complex problems. In vision -- a domain we can sort-of study because we can stick electrodes into animals' brains -- it turns out that the image processing algorithms implemented in brains are actually very sophisticated and close to Bayes-optimal, backed up with an incredible amount of processing power. Why not start with the default assumption that the same is true of language?

It's true that in word segmentation, a simple ad hoc procedure -- a simple greedy learning algorithm that ignores all interword dependencies -- actually does a "reasonable job" of segmentation, and that improving either the algorithm's search procedure or making it track more complex dependencies actually decreases the overall word segmentation accuracy. Here I think Ben Borschinger's comment has it right - sometimes a suboptimal algorithm can correct for the errors of a deficient model if the errors of each go in the opposite way. We've known since at least Goldwater's work that an inaccurate "unigram" model that ignores inter-word interactions will prefer to find multi-word collocations and hence undersegment. On the other hand, a naive greedy search procedure tends to over-segment, i.e., find word boundaries where there are none. Because the unigram model under-segments, while a naive greedy algorithm over-segments, the combination actually does better than approaches where you just improve only the search procedure or only the model (by incorporating inter-word dependencies) since now you have "uncancelled errors".

Of course it's logically possible that children use ad hoc learning procedures that rely on this kind of "happy coincidence", but I think it's unlikely for several reasons.

First, these procedures are ad hoc -- there is no theory, no principled reason why they should work. Their main claim to fame is that they are simple, but there are lots of other "simple" procedures that don't actually solve the problem at hand (here, learn words). We know that they work (sort of) because we've tried them and checked that they do. But a child has no way of knowing that this simple procedure works while this other one doesn't, so the procedure would need to be innately associated with the word learning task. This raises Darwin's problem for the ad hoc algorithm (as well as other related problems: if the learning procedure is really innately specified, then we ought to see dissociation disorders in acquisition, where the child's knowledge of language is fine, but their word learning algorithm is damaged somehow).

Second, ad hoc procedures like these only partially solve the problem, and there's usually no clear way to extend them to solve the problem fully, so some other learning mechanism will be required anyway. For example, the unigram+greedy approach can find around 3/4 of tokens and 1/2 of the types, and there's no obvious way to extend it so it learns all the tokens and all the types. But children do eventually learn all the tokens and all the types, and we'll need another procedure for doing this. Note that the Bayesian approach that relies on more complex models does have an account here, even though it currently involves "wishful thinking": as the models become more accurate by including more linguistic phenomena and the search procedures become more accurate, the word segmentation accuracy continues to improve. We don't know how to build models that include even a fraction of the linguistic knowledge of a 3 year old, but the hope is that eventually these models would achieve perfect word segmentation, and indeed, be capable of learning all of a language. In other words, there isn't a plausible path by which the ad hoc approach would generalise to learning all of a language, while there is plausible path for the Bayesian approach that relies on more and more accurate linguistic models.

Finally -- and I find it strange to be saying this to a linguist who is otherwise providing very cogent arguments for linguistic structure -- there really are linguistic structures and linguistic dependencies, and it seems weird to assume that children use a learning procedure that just plain ignores them. Maybe there is a stage where children think language consists of isolated words (this is basically what a unigram model assumes), and the child only hypothesises larger linguistic structures after some "maturation" period. But our work shows that you don't need to assume this; instead, a single model that does incorporate these dependencies combined with a more effective search procedure actually learns words from scratch more accurately than the ad hoc procedures.

Norbert sometimes seems very sure he knows what aspects of language have to be innate. I'm much less sure myself of just what has to be innate and what can be learned, but I suspect a lot has to be innate (I think modern linguistic theory is as good a bet as any). I think an exciting thing about Bayesian models is that they give us a tool for investigating the relationship between innate knowledge and learnability. For example, if we can show that a model with innate knowledge X+X' can learn Z from data Y, but a model with only innate knowledge X fails to learn Z, then probably innate knowledge X' plays a role in learning Z. I said probably because someone could claim that the child's data isn't just Y but also includes Y' and from model X and data Y+Y' it's possible to infer Z. Or someone might show that a completely different set of innate knowledge X'' suffices to learn Z from Y. So of course a Bayesian approach won't definitely answer all the questions about language acquisition, but it should provide another set of useful constraints on the process.

11 comments:

NorbertApril 23, 2015 at 2:22 PM
This comment in is into parts due to length. It's sort of fun commenting on someone else's post.

Part 1:

@ Mark, some comments:

“…Bayesian approaches are often conflated with strong empiricist approaches, and I think this post does that too.”

If so, this was entirely unintentional. All that I was interested in was finding some real live exemplars of “less is more” reasoning. From what I was able to tell, it has generally been taken for granted that the only issue wrt finding algorithms is dealing with resource issues. So there was a tacit assumption that Carnap’s Principle is a regulative ideal in that the more information you can take into account the better. And by “better” we mean does a better job of doing what needs doing, e.g. word learning, word segmentation etc. I took Gigerenzer to be questioning this. And this had nothing to do with Empricism (though you are quite right to think I see E’s baleful menace lurking everywhere). As evidence of my intent I note that I mentioned that these considerations applied to my favorite view of things linguistic: the ideal speaker-hearer model. At any rate, here I plead innocent.

“Given this, I think a very reasonable approach is to formulate a model that includes as much relevant information from universal grammar as we can put into it, and perform inference that is as close to optimal as we can achieve from data that is as close as possible to what the child receives.”

“Why not start with the default assumption that the same is true of language?”

I could not agree more as a research strategy. However, it seems worth keeping in mind that the “optimal” strategy might not be the “rational” one if by this we mean adhering as closely as possible to Carnap’s Principle. I take this to be a consequence of Charles’s discussion where he provided some parameters for when ignoring information might be optimal even if not rational via CP.
ReplyDelete
Replies
NorbertApril 23, 2015 at 2:23 PM
Part 2:

I completely agree with your very reasonable reservations about the relations between algorithmic complexity and brain circuitry. But I am not sure about which Chomskyans you are thinking of when you think them opposed to rational analyses of acquisition. Like I’ve said before, this program looks very like the one outlined in Aspects with a few statistical bells and whistles. I know that Jeff Lidz is a big fan and I’ve discussed his work on FoL (here). The push back has not been ideological (e.g. Berwick, Yang) but empirical (this approach does better than the Bayes one in this context). What’s wrong with that? It is, I would think, an empirical issue in the end, right? Bayesians sometimes talk as if their game is the only one in town because it is rational. Well, maybe. But it’s nice to find cases where we can “test” the view and it is important to see that this argument is not conceptual but empirical. For this we need examples at work. Are these analyses right? I wouldn’t be in a position to know. Are they interesting? Well, yes, at least conceptually.

“But our work shows that you don't need to assume this; instead, a single model that does incorporate these dependencies combined with a more effective search procedure actually learns words from scratch more accurately than the ad hoc procedures.”

Again, my intent was not to endorse the models but to provide exemplars of where the common CP assumption might prove false and how to see this in a real world case. I find this idea interesting for my own parochial reasons. Your bet, which might be right, is that once the right nativist assumptions are plugged in that care about linguistic structure that something like Bayes will prove to be close to correct, as in vision. Maybe. Like I said, this would vindicate the Aspects view of things (and would not be something that I would gainsay). But it is worth noting alternatives that are not silly even if they might be wrong. Again, the only way to find out is by inspecting particular cases.

“I think an exciting thing about Bayesian models is that they give us a tool for investigating the relationship between innate knowledge and learnability.”

I agree and I’ve highlighted pieces that try to do this in FoL. So, let’s be clear: I have nothing against Bayes. In fact, I think that it’s worth investigating empirically. I do have a problem with the assumption that it must be right and with some of the overselling. But the basic idea seems fine. And with all such fine ideas it’s nice to know what would be considered a non-version of the idea.
ReplyDelete
Replies
Mark JohnsonApril 24, 2015 at 6:34 AM
Thanks Norbert for posting my comments -- as an article nonetheless! I've been called many things, but this is the first time I've been called "estimable". (Earlier this week a colleague said I had "gravitas"; I'm not sure if she's referring to my age or my weight!).

Anyway, at the risk of boring our readers by having too great an agreement, I do of course agree with you that a Bayesian approach is very close to the Aspects view. As I hope you also agree, Bayesian theory also suggests general principles for setting parameters in a Principles and Parameters approach, and a way around "no negative evidence" issues in parameter setting, so I think it is quite compatible with the general generative point of view. (I'm not sure it has much to say about Darwin's problem and Minimalism, but I'm open to suggestions!)

In order to enliven this post let me end with a few deliberately tendentious things.

First, I think some people -- often psychologists -- speak as if experimental results such as reaction times are the most important facts about language. But of course they aren't. The central fact about language acquisition is that children actually acquire languages, just as the central fact about language processing is that humans actually produce and comprehend sentences. A theory of acquisition or processing is seriously deficient if there's no plausible way of extending it to account for these central facts, even if it happens to agree exquisitely with certain experimental results. (Let me temper this by saying that it may be reasonable to focus on a deficient theory if one thinks it is the most promising path forward, but you shouldn't forget that the theory is nevertheless deficient!).

I also think that a weakness of the Bayesian approach is that it hypothesises an "ideal learner", but doesn't explain how such an ideal learner might actually function. I think that's actually a reasonable approach, given that we know next to nothing about how complex representations such as trees are actually processed in the brain. It parallels the classical approach in generative grammar to language processing, which is factored into competence and performance. A Bayesian "ideal learner" is a kind of competence theory, and eventually I hope we'll be able to understand the "learning performance limitations" of real human learners. I'd like to see an approach to acquisition that parallels the classical approach to performance in generative grammar, in which performance constraints interact with but don't supplant the ideal competence theory.

Best,

Mark
ReplyDelete
Replies
NorbertApril 25, 2015 at 5:28 PM
Another comment from Mark J via N(orbert)-mail:

Of course all we need is a theory that accounts for the languages children actually learn from the kinds of data they actually receive. But so far nobody has anything close to this. There's at least a pathway along which Bayesian approaches can be extended to achieve this goal, although I expect there will be many challenges and surprises along the way. Of course it's interesting that Charles' model agrees with his experimental data, but I think it's important to remember that these experiments aren't the same as language acquisition, and I don't see a way of extending Charles' model to provide a general model of language acquisition. And I see explaining language acquisition, rather than any single psychological experiment, as our goal. (I don't want to put words in Charles' mouth, but I think he could plausibly argue that it's premature to worry about general theories of language acquisition; his experiments might be our equivalent of Galileo's inclined planes, and all will become clear when our Newton arrives).

I think a version of Darwin's problem is lurking behind much of computational language acquisition and computational psycholinguistics. How does the child know which algorithm it should use to acquire all different kinds of knowledge it acquires? (This is still very challenging even if one claims that only the lexicon needs to be acquired since lexical entries can be quite abstract, e.g., empty functional categories that determine word order and extraction domains). What about the parsing algorithm and the production algorithm? It's possible that we have innate procedures for each of these (e.g., computer programs in the genome), but I think we should try to see if they might follow from more general principles. Approaches like the Bayesian one provide a partial answer here: if prior information and data are combined following certain principles, then the posterior is guaranteed to converge to the correct result. So if the child can find an algorithm that follows these principles with respect to a certain body of knowledge and type of data, the child has a language acquisition procedure, or parser, or whatever.

Now Bayesian principles don't specify how these principles should be instantiated in an algorithm. There are algorithm recipes for algorithms that follow these Bayesian principles (the Particle Filter algorithm schema is one of these, and from 30,000 feet Charles' algorithm seems similiar to Lisa Pearl's 1-particle particle filter algorithm), and if the child can follow one of these recipes they would have an algorithm guaranteed to eventually converge to the correct hypothesis. But there are at least a couple of challenges here. First, finding an algorithm that follows one of these recipes is often non-trivial for problems with complex structure: e.g., you can probably get a publication in a computational linguistics journal if you can devise a particle filter that learns morpho-phonology. Second, because we know essentially nothing about how hierarchical structures like trees are represented in neural circuitry, we have little idea of what kinds of computations are simple or natural for the brain to perform. So I suspect it's premature to try to identify the algorithms used in acquisition. Instead, I think we should try to identify the knowledge and information used in acquisition; e.g., obtain results along the line of "innate principle X and primary linguistic data Y suffices to infer Z". But of course I really don't know what approach (if any) will help us understand language acquisition.
ReplyDelete
Replies

Add comment

Faculty of Language

Comments

Thursday, April 23, 2015

Where the estimable Mark Johnson corrects Norbert (sort of)

11 comments:

Contributors