Thursday, April 20, 2017

Cedric Boeckx replies to some remarks by Hornstein on Berwick and Chomsky's "Why Only Us"

Norbert Hornstein commented on my "quite negative" review of Berwick and Chomsky's book _Why Only Us_ published in Inference ( Here is the link to his comments, posted here on the Faculty of Language blog:

I want to begin by thanking him for reading the review and sharing his thoughts. The gist of his remarks is, if I read him right, that he is "not sure" what I find wrong with _Why Only Us_. He concludes by saying that "Why [Boeckx] doesn’t like (or doesn’t appear to like) this kind of story escapes me." I found the paragraphs he devotes to my review extremely useful, as they articulate, with greater clarity than what I could find in _Why Only Us_, some of the intuitions that animate the kind of evolutionary narrative advocated in the book I reviewed: The "All you need is Merge" approach. Hornstein's points give me the opportunity to stress some of the reasons why I think this kind of approach is misguided. This is what I want to express here.

In so far as I can see, Hornstein makes the following claims (which are, indeed, at the heart of _Why Only Us_) [the claims appear in roughly the order found on Hornstein's original post]

1. Hornstein seems to endorse Tattersall's oft-repeated claim, used in Why Only Us, that there is a link between linguistic abilities and the sort of symbolic activities that have been claimed to be specific to humans. This is important because the fossil evidence for these activities is used to date the emergence of the modern linguistic mind; specifically, to argue for a very recent emergence of the modern language faculty. Here is the relevant passage:
"in contrast to CB [Boeckx], IT [Tattersall] thinks it pretty clear that this “symbolic activity” is of “rather recent origin” and that, “as far as can be told, it was only our lineage that achieved symbolic intelligence with all of its (unintended) consequences” (1). If we read “symbolic” here to mean “linguistic” (which I think is a fair reading), it appears that IT is asking for exactly the kind of inquiry that CB thinks misconceived."
Perhaps it is too strong to say that Hornstein endorses it, but clearly he does not buy my skepticism towards this kind of argument, expressed in my review (and backed up with references to works questioning Tattersall that unfortunately Hornstein does not discuss, delegating to Tattersall as the unique expert.)

2. Hornstein grants me "several worthwhile points"; specifically, my claims that "there is more to language evolution than the emergence of the Basic Property (i.e. Merge and discretely infinite hierarchically structured objects) and that there may be more time available for selection to work its magic than is presupposed". Hornstein writes that "many would be happy to agree that though BP is a distinctive property of human language it may not be the only distinctive linguistic property." He continues "CB is right to observe that if there are others (sometimes grouped together as FLW vs FLN) then these need to be biologically fixed and that, to date, MP has had little to say about these. One might go further; to date it is not clear that we have identified many properties of FLW at all. Are there any?" Later on, he writes "CB is quite right that it behooves us to start identifying distinctive linguistic properties beyond the Basic Property and asking how they might have become fixed. And CB is also right that this is a domain in which comparative cognition/biology would be very useful", but stresses that "It is less clear that any of this applies to explaining the evolution of the Basic Property itself."

3. Hornstein seems to think that my problem is that I "think that B[erwick]&C[homsky] are too obsessed with" recursion (or Merge). He goes on: "But this seems to me an odd criticism. Why? Because B&C’s way into the ling-evo issues is exactly the right way to study the evolution of any trait: First identify the trait of interest. Second, explain how it could have emerged.  B&C identify the trait (viz. hierarchical recursion) and explain that it arose via the one time (non-gradual) emergence of a recursive operation like Merge. The problem with lots of evo of lang work is that it fails to take the first step of identifying the trait at issue. ... If one concedes that a basic feature of FL is the Basic Property, then obsessing about how it could have emerged is exactly the right way to proceed"

4. He thinks that my "discussion is off the mark" (specifically, my insistence on descent with modification and bottom-up approaches in the review) because Merge "is not going to be all that amenable to any thing but a “top-down, all-or-nothing” account". "What I mean", Hornstein says, "is that recursion is not something that takes place in steps"; "there is no such thing as “half recursion” and so there will be no very interesting “descent with modification” account of this property. Something special happened in humans. Among other things this led to hierarchical recursion. And this thing, whatever it was, likely came in one fell swoop. This might not be all there is to say about language, but this is one big thing about it and I don’t see why CB is resistant to this point."

5. Hornstein stresses that he "doubt[s] that hierarchical recursion is the whole story (and have even suggested that something other than Merge is the secret sauce that got things going), I do think that it is a big part of it and that downplaying its distinctiveness is not useful." He goes on: we "can agree that evolution involves descent with modification. The question is how big a role to attribute to descent and how much to modification (as well as how much modification is permitted). The MP idea can be seen as saying that much of FL is there before Merge got added. Merge is the “modification” all else the “descent.” "No mystery about the outline of such an analysis, though the details can be very hard to develop"... "it is hard for me to see what would go wrong if one assumed that Merge (like the third color neuron involved in trichromatic vision (thx Bill for this)) is a novel circuit and that FL does what it does by combining the powers of this new operation with those cognitive/computational powers inherited from our ancestors. That would be descent with modification"

6. He sums up: "The view Chomsky (and Berwick and Dawkins and Tattersall) favor is that there is something qualitatively different between language capable brains and ones that are not. This does not mean that they don’t also greatly overlap. It just means that they are not capacity congruent. But if there is a qualitative difference (e.g. a novel kind of circuit) then the emphasis will be on the modifications, not the descent in accounting for the distinctiveness. B&C is happy enough with the idea that FL properties are largely shared with our ancestors. But there is something different, and that difference is a big deal. And we have a pretty good idea about (some of) the fine structure of that difference and that is what Minimalist linguistics should aim to explain"

All of these are interesting points, although I think they miss the target, for reasons worth making explicit (again), if only because that way we can know what is likely to be productive and what is not. After all, I could be wrong, and Hornstein (and Berwick/Chomsky in Why Only Us) could be wrong. I'll tackle Hornstein's points in a somewhat different order from the one he used, but I don't think that doing so introduces any misrepresentation.

Let's begin with points of (apparent) agreement: Hornstein is willing to concede that we need a bit more than Merge, although if I read him well, he is not as clear about it as I would like. Why do I say so? On the one hand, he writes that "many would be happy to agree that though BP is a distinctive property of human language it may not be the only distinctive linguistic property. CB is right to observe that if there are others (sometimes grouped together as FLW vs FLN) then these need to be biologically fixed and that, to date, MP has had little to say about these. One might go further; to date it is not clear that we have identified many properties of FLW at all. Are there any?" On the other hand, he writes  "CB is also right that this is a domain in which comparative cognition/biology would be very useful (and has already been started [FN:There has been quite a lot of interesting comparative work done, most prominently by Berwick, on relating human phonology with bird song").

I won't comment much on the references provided by Hornstein in that footnote, but I must say that I think it reveals too much of a bias towards work done by linguists. In my opinion, the great comparative work that exists has not been done by linguists (in the narrow sense of the term). Hornstein's is not a lovely bias to display in the context of interdisciplinarity (indeed, it's not good to show this bias on a blog that likes to stress so much that people in other disciplines ignores the work of linguists. Don't do unto others ...) In the case of birdsong, this kind of work goes several decades back, and detailed studies like Jarvis 2004(!), or Samuels 2011 (cited in my review) hardly justify the "has already been started" claim about comparative cognition. But let's get to the main point: we can't just ask "are there any? (shared features)" and at the same time cite work that shows that there is a lot of it. But there is something worse, in light of my review: Hornstein seems to have no problem with the usefulness of comparative cognition ("a domain in which comparative cognition/biology would be very useful") so long as it applies to everything except Merge ("there will be no very interesting “descent with modification” account of this property"; "It is less clear that any of this applies to explaining the evolution of the Basic Property itself." "this property is not going to be all that amenable to any thing but a “top-down, all-or-nothing” account") This is one of the issues I intended to bring up in the review, and what I called "exceptional nativism". I'll return to this below, but for now, let me stress that even if Hornstein grants me that there is more than Merge, Merge is still special, in a way that is different from "other distinctive linguistic properties".

It's not the case that I object to _Why Only Us_ because I think Berwick and Chomsky are "too obsessed with Merge". I object to it because I think they obsess about it in the wrong way: they (and Hornstein) are obsessed in making it not only special, but distinct-in-a-way-that-other-distinct-things-are-not: it takes it out of the Darwinian domain of descent with modification.

Hornstein discusses Descent With Modification, but his prose reveals that he and I understand it differently. Indeed, he appears to understand it in a way that I warned against in my review. Here is the key passage: "the MP idea can be seen as saying that much of FL is there before Merge got added. Merge is the “modification” all else the “descent.” " I think this is wrong. It takes the phrase descent with modification pretty much like most linguists understood the FLN/FLB distinction of Hauser, Chomsky and Fitch 2002: there is FLN and there is FLB. There is descent and there is modification. But I don't think this is the core of the Darwinian logic: "Descent with modification" ought to be understood as "modified descent" (tinkering), not as "descent" put side by side with/distinct from modification. Because modification is modification of something shared; it's inextricably linked to descent. Descent with modification is not "this is shared" and "this is different" and when you put these 2 things you get "descent and modification", because the different is to be rooted in the shared. We can't just say Merge is the 'modification' bit unless we say what it is a modification of. (Needless to say, if, as Hornstein writes, we replace Merge by some other "secret sauce that got things going", my point still applies. The problem is with thinking in terms of some secret sauce, unless we break it down in no so secret ingredients, ingredients that can be studied, in isolation, in other species. That's the message in my review.)

The Darwinian logic is basically that the apple can only fall so far from the tree. The apple is not the tree. But it is to be traced back to the tree. As I put it in the review: there's got to be a way from there (them) to here (us). And Merge should not be any exception to this. I'll put it this way: if the way we define Merge makes it look like an exception to this kind of thinking, then we are looking (obsessing?) at Merge in the wrong way. If we find Merge interesting (and I do!), then we've got to find a way (or better, several ways, that we can then test) to make it amenable to "descent with modification"/"Bottom-up (as opposed to all-or-nothing/top-down) approaches. Of course, we could say, well, tough luck: we can't choose what nature gave us. It gave us Merge/recursion, and you can't understand this gradually. It's there or not. It's discontinuous but in a way different from the kind of discontinuity that biologists understand (significantly modified descent). But if we do so, then, tough luck indeed: we are confining Darwin's problem to a mystery. A fact, but a mysterious one ("Innate knowledge is a mystery, though a fact", as McGinn once put it.) It's fine by me, but then I can't understand how people can write about how "All you need is Merge" accounts shedding light on Darwin's problem. They must mean Darwin's mystery. To use Hume's phrase, such accounts restore Merge to that obscurity, in which it ever did and ever will remain”.

Don't get me wrong. It's a perfectly coherent position. Indeed, in a slightly different context, referenced in my review, David Poeppel talks about the "incommensurability problem" of linking mind and brain. Maybe we can't link both, just like we can't understand the evolution of Merge. Note here that I say the evolution of Merge. At times, Hornstein, like Berwick and Chomsky, gives me the impression that he thinks Merge is the solution. But it's the problem. It's that which needs to be explained. And I think that one way to proceed is to understand its neural basis and trace back the evolution of that (that is, engage with Poeppel's granularity mismatch problem, not endorse his incommensurability problem), because perhaps Merge described at the computational level (in Marr's sense) is mysterious from a descent-with-modification perspective, but not so at the algorithmic and implementational levels. And I think  it's the task of linguists too to get down to those levels (jointly with others), as opposed to lecturing to biologists about how Merge is the solution, and it's their problem if they don't get it. (Incidentally, it's a bit ironic that Hornstein praises Lobina's discussion of recursion in his blog post, but does not mention the fact that Lobina took issue with Hornstein's take on recursion in some of his publications.)
Hornstein writes that "The problem with lots of evo of lang work is that it fails to take the first step of identifying the trait at issue". He does not give references, so I cannot judge what he means by "lots". I like to think I read a lot, and my assessment doesn't match Hornstein's at all. I think a lot of work in evo of lang promotes a Darwinian feeling for the phenotype. This is quite different from, say, a Darcy-Thompsonian feeling for the phenotype. I see in lots of evo of lang work a desire to make talk about evo of language amenable to conventional evolutionary discourse. Why Only Us ("this property is not going to be all that amenable to any thing but a “top-down, all-or-nothing” account") is the exact opposite.

Perhaps a bit of an empirical discussion would help (I always find it helpful, but I don't know if Hornstein would agree here). Let's take the example of vocal learning, a key component of our language-faculty-mosaic. Not a cool as Merge for some, but still pretty neat. In fact, the way many talks and papers about vocal learning begin is quite like the way linguists like to talk about Merge. Let me spell this out. It's often pointed out that vocal learning  (the ability to control one's vocal apparatus to reproduce sounds one can hear, typically from con-specifics) is a fairly sparsely distributed trait in the animal kingdom. It may not be as unique as Merge ("Only us"), but it's close: it's "only" for a selected few. (I stress that I am talking about the classic presentation of this trait; ideas of a vocal learning continuum, which I like, would only make the point I am about to make stronger. See work by Petkov and Jarvis on this.) Like Merge, Vocal learning is an all or nothing affair. You have it, or you don't. It looks like an all or nothing thing. But unlike Merge, people have been able to gain insight into its neural structure, and break it down to component pieces. Among these, there is a critical cortico-laryngeal connection that appears to qualify for the "new circuit" that underlies vocal learning (see Fitch's book on evolution of language for references). And people have been able to get down to the molecular details for birds (Erich Jarvis, Constance Scharff, Stephanie White, and many others), bats (Sonja Vernes), and link it to language/speech (Simon Fisher and lots of people working on FOXP2). Erich Jarvis in particular has been able to show that most likely this new circuit has a motor origin, and "proto" aspects of it may be found in non-vocal learning birds (suboscines). All of this is quite rich in terms of insight. And this richness (testability, use of comparative method, etc.) makes the Merge solution to Darwin's problem pale in comparison. It's true, as Hornstein points out,  linguists know a lot about Merge, but they only know it from one perspective (the "Cartesian" perspective, the one that leads to "Why Only Us"), and this may not be the most useful perspective from a Darwinian point of view. The main message of my review  of Why Only Us was that.

Final point, on timing, and Hornstein's appeal to Tattersal's review of the fossil evidence for the emergence of symbolic behavior. No one can know everything (where would one put it?, as they say), but in my experience it's always good to rely on more than one expert (I cite some in the review). My main point about this in the review was not so much to question the dating of said emergence, but rather to ask for an account of how symbolic behavior is linked to linguistic behavior. I can see why it's plausible to think these are linked. But if the argument concerning the timing of the emergence of the language faculty rests on very few things, and it's one of them, we want more than a plausibility argument. (Norbert's blog loves to say "show me the money" when non-generativists make claims about language acquisition based on what strikes them as plausible. I guess, it's only fair to say "show me the money", or else I'll start selling Hornstein bridges.) It's plausible to think "how could they have done it without language". So, symbol, ergo language. But then again, I can't remember how I lived without my iphone. Poverty of Imagination arguments can be very weak. I know of very few attempts to mechanistically link symbolic and linguistic behavior. One attempt, by Juan Uriagereka, was about "knots" and the Chomsky hierarchy. David Lobina showed how this failed in a paper entitled "much ado about knotting". Victor Longa tried to link blombos style marks to crossing dependencies in language, but it's fair to say the argument for crossing dependencies there is nowhere near as neat as what Chomsky did in Syntactic Structures. Apart from that, I am not aware of any explanatory attempt. I once asked Tattersal after a talk he gave, and he told me something that amounted to "ask Chomsky". When I ask linguists, they tell me ask the experts like Tattersall. And so I begin to be suspicious...

But there is something more I could say about the timing issue that came to my mind when reading Hornstein's comments: if Merge is such an all-or-nothing thing, not subject to the logic of descent, then why should we care if it's recent or not? It could have emerged, in one fell swoop, millions of years ago, and remain "unattached" to "speech devices" until recently. And so why do we want to insist about a very recent origin? The timing issue is only very important if issues about gradual evolution matters. But if we stress that gradual evolution is out of the question, then, the timing issue is a red herring. So, why does Hornstein insist on it?

Let me close by thanking Norbert again for his comments. They led to this long post. But they often made me feel like Alice when she tells the Red Queen that one can't believe impossible things. Hornstein, like the Red Queen, disagrees. But it's not Wonderland, it's Darwinland.


  1. Thx for the reply. Some comments:
    1. I did not endorse Tatersall. I did note that he is expert in these areas and that his views coincide with those of B&C. Your review suggested that B&C's views were contentious. It was worth noting that there are experts who agree with them concerning their timing AND on thinking the why-only-us question is biologically relevant and important. I take it we agree that IT counts for these purposes. "Inference" seemed to think so.

    2. It is not that Merge is special, but that hierarchical recursion is special. Do we have analogues of the kinds of structures we find in Gs anywhere else? Poeppel doesn't think so, Dehaene doesn't. IT doesn't. And so far I have not seen anyone present evidence that there is anything like G hierarchical recursion anywhere else. Now, this does not mean that it could not be lurking on other non human cognitive domains. So show us. If it does not (and I am assuming it doesn't) then it is unique. And in need of explanation.

    How to explain THIS kind of unique capacity? Well, not in small steps if by this one means that one gets hierarchy, then a bit more, then bit more and then presto unbounded. This does not work for the reasons that Dawkins noted (and many before him). You don't get to unbounded in small steps. So how do you get it? Well, all at once seems plausible if the inductive operation is simple. Merge is the proposed simple operation that comes in all at once and if it does it explains the unique property observed.

    So, we have an identified trait (unbounded hierarchical recursion) and an identified mechanism (Merge). And a conclusion: if it is Merge then it came in all at once. As you might know, I have proposed another mechanism. But it has the same features: it did not come in piecemeal. At any rate, I do not see anything wrong with this line of argument and nothing you mentioned leads me to change my mind here. Maybe if you identified other cognitive domains found in our ancestors that have this kind of hierarchical recursion.

    BTW, what made the Berwick Bolhuis et al stuff interesting is that it showed that a plausible analogue (Bird Song) has a different computational than the one found in syntax. And the way that it is different makes it hard to see how without substantial "modification" we could get from it to what we have, even if one thought that syntax built on this.

    3. I have no idea what you are driving at in the sermonette on descent with modification. I have no problem arguing that what we have builds on what our ancestors had. But I am quite confident that what we have is qualitatively different from what they had in at least one respect (unbounded hierarchy). If I am interested in THIS property (I am) then I am ready to accept the view that the addition of recursive circuits of some kind were the introduction of a real novelty. Can one get this by small rewritings of the brain? Who knows! We know next to nothing about these issues in even simpler domains. If it can be, great. Right now, nada interesting has been found, or so my experts tell me.

  2. 4. I think that going to neural basis of a trait to explaining a trait is going about it backwards. To identify the neural basis of a trait we need to identify the trait. Until then we cannot find ITs neural basis. The trait is unbounded hierarchy. GG has described it pretty well. Now we can look for the neural basis (as Poeppel and Dehaene and others are doing). Interestingly this work has been stimulated by the Merge based view of structure building, so the linguistics has not been entirely idle.

    Moreover, going about things in this way is likely the only way to do it. We even have some history here. Mendel preceded the biochemists. We did not build up genetic theory from biochemical primitives but went looking for biochemical structures that could code genetic structures. I think the this is the right way to go in boiling as well: find the ling structures and look for implementations. This is what Poeppel and Dehaene are doing, and I think it is a very smart way to go. Maybe you don't, but I don't see why not.

    5. I don't mention Lobina criticized me on recursion and so I cannot commend his review? This is a joke right? Or is the problem that if you commend someone then you must also fess up that they think you have been wrong? LOL.

    6. I tend to agree that the timing issue might be a red herring. Whenever Merge arose (if it is Merge) it was simple. The best argument I can see for thinking it was recentish is that it has remained relatively unencumbered. It is the stability of recursion and the fact that we don't have different kinds in different Gs that suggests it is new and so has not been subject to evolution pressures to "improve." Now, one might answer that there is no room for improvement and that's why we find the same kinds of recursive hierarchies in all Gs. And if this is right then the recency argument is not a big deal.

    That said, it seems that many OTHER THAN B&C think it a recent addition and IF that is so then it strengthens the argument that whatever popped out had to be "simple." There remain other problems with even this conclusion (e.g. what's the relation between functional simplicity and physiological/genetic simplicity) but this simply assumes the 'Phenotypic Gambit" (something I talked about in another post) and what is pretty standard in evolution discussions.

    So, yes, the recency argument is not all that strong nor all that necessary. If it is true, it adds a modest amount of motivation for simple operations like Merge.

    1. "Now we can look for the neural basis (as Poeppel and Dehaene and others are doing)"

      I just want to push back on the equation of neural signals that correlate with phrase structure and the "neural basis" of merge or some other GG construct. This assumes parser=grammar! This is why Bemis & Pylkkanen et seq. discuss "syntactic structure-building" when comparing simple phrases to word lists, and why it drives me bonkers when Friederici et al. discuss the same manipulation as isolating "merge."

      I worry that it pushes us in the wrong direction (Embick & Poeppel's "mutual sterilization") to forget that the brain is doing parsing when it is comprehending sentences. Many effective (and minimalist friendly!!) parsing models reject a parser=grammar assumption (e.g. Stabler, going back at least to the 1991 piece on non-pedestrian parsing.)

      Upshot: Until we've resolved R(parser,grammar), the neural signals related to phrase structure will remain only indirectly related to costructs like merge.

      (Also, credit to the junior scholars doing the heavy lifting/lead authoring on these projects: Nai Ding and Matt Nelson!)

    2. The parser is not the G, it is the machine table/instructions on how to parse an incoming string. So, yes, the parser is not the grammar, but it might (partially) index parsing complexity. This is what we might expect if the relation between G and the G the parser implements is sorta transparent. The most recent paper by Dehaene and Co does the right thing by embedding a Merge based G in a left corner parser and see what this implies for brain behavior. As you know (and I will blog on very soon) it looks like something like a stack is implicated (as we knew it would be) and that the brain load looks well indexed by such a Parser implementing a Merge based G.

      Your main point, however, is right on: we cannot detect Gs directly as they don't DO anything to incoming strings. Gs don't parse. But Gs in Parsers to determine how paring proceeds and there can be more or less transparent mappings between Gs and the Parsers that implement them. I think that a reasonable first assumption is that the mapping is pretty transparent. This makes investigating these kinds of objects easier. It even looks like it might be roughly right. As Alec Marantz once observed, this is just standard cog-neuro and there is no reason not to apply to the study of language just as we do the study of anything else.

    3. "But Gs in Parsers to determine how paring proceeds"

      Yes, I'd agree that they "*partially* determine" how parsing proceeds, and more broadly with the notion of "sorta transparent." How transparent? We don't have to make a first assumption that the mapping is "pretty transparent": we have of computational psycholx theories that spell out plausible mappings in gory detail! Nelson et al., and others, are taking advantage of these models (e.g. Hale's automata), but labeling the outcome "merge" is misleading. If you see brain correlates with bottom-up stack depth, then thats what you're seeing. Stack depth =/= merge, even if it's dynamics are partially conditioned by grammatical structures.

      Quibbles: Nelson et al. don't implement a MG (it's not clear to me exactly what exactly they implement... it has agreement, but their version can probably be cached out as a CFG). Also, their data are ambiguous between left-corner and bottom-up parsing. I have a paper forthcoming in Cognitive Science that points towards left-corner specifically using methods not dissimilar to Nelson's (doi: 10.1111/cogs.12445). The wonderful world of academic publishing means ours has been online for about 6 months but has yet to have an issue.

      The role of a stack is an important one to pull out. It's also at the bases of my work with John Hale. But, Lewis, Vashisth, McElree et al. are not wrong when they reject stack-based memory for humans (interference effects!) The current crop of neuro studies just assume stack-based parsers, they don't argue for them over alternatives.

    4. I agree with partially. Not sure I see the point re Merge and stack depth. You put things on stacks till they can be resolved by some operation specified by the G. In the PNAS paper the metric for time on stack and the periodicity of the resolution into larger pharses seems well described by a Merge bas d system. So, there is a connection. Might it be more co plex than Merge and still get the same results. Sure. They choose Merge for parsimony reasons, and this strikes me as natural and right. Moreover like merge the brain measures abstract from qulity or size of things merged. At any rate, the fit seems good and is interesting work, thoug very definitely not the last word.

      I agree stacks are important and that its nice to see them making a comeback. I also agree that there is no reason to think that their evidence for bottom up parsing is compelling given the kinds of Ss they targetted. I like left corner parsing for Berwick Weinberg reasons.

      So, quibbles accepted, but still think that G simplification along MP lines is what is intstigating this excellent new line of inquiry.

    5. Unless you spell out for me how a Minimalist grammar is transparent to a left-corner parser (or any other parser), I do not believe it.

  3. A few thoughts.

    1. How is citing Samuels 2011 a tonic for "too much of a bias towards work done by linguists"? And though you cite Samuels 2011 in your review, you don't cite Jarvis 2004. For those interested that one is:

    2. I don't think vocal learning is an all-or-nothing trait, see for example Hultsch & Todt (2008) "Comparative aspects of song learning". For example, in some species there is a strong seasonal hormonal component to song learning that is noticeably absent in other species. Incrementalism seems easier on this front than it does on the Merge issue.

    3. Do we think that Merge is substantially different in manual (signed) languages? If not, then the vocal learning aspect seems irrelevant to that question (though probably not to at least some of phonology).

  4. To the extent that these differences of opinion are differences of perspective, I think this is what is at issue:

    Suppose we agree on three points:

    i) Hominins pre-Merge had cognitive abilities rivaling close relatives, but in degree rather than kind.

    ii) The immediate precursor to Homo sapiens was equipped with whatever physiological and conceptual systems are in use today.

    iii) Given (ii), the appearance of Merge secured linguistic ability in a very short time.

    You can then frame these facts in two ways:

    1) You say that Merge is our only genuine qualitative uniqueness, so the 'all you need is Merge' narrative follows naturally - we've (apparnetly) controlled for other biological properties, so it's enough to derive linguistic ability from that operation. From this, some people then draw the non-necessary conclusion that Merge is to a large extent independent of other properties of the species and systems it finds itself in - it's an abstractly available computation that happened to be implemented in humans and happened to be implemented in the way it was.

    2) You say that Merge is indeed our only genuine qualitative uniqueness and, still, the 'all you need is Merge' narrative follows, but this time in the more restricted sense that all *Homo sapiens* needed was Merge, but Merge itself needed particular aspects of the human cognitive system and is not independent in the way claimed by (1).

    Whereas (1) finds that language is predicated on Merge and Merge is predicated on little to nothing much else, (2) finds that language is predicated on Merge and Merge is predicated on something already special in human cognitive design.

    If we think of the typical modular description of syntax as embedded in the performance systems with which it interfaces, we need to ask to what extent that embeddedness is essential for the existence of syntax in the first place. Why are the generative mechanisms capable of taking the precise input that they take at all, and not of taking anything else? Why are the structures it produces readable at all by other cognitive systems? Is it because it just happens to be those neural circuits that had Merge-sauce drizzled over them, or was it because there was something already in the nature of those circuits that led to the very possibility of Merge being implemented? In the latter case, it remains a truism that Merge came in an instant - there is no half-recursion - but it would have been merely the last indivisible step in a long sequence of descending, modifying steps.

    1. I like this summary. And frankly we don't know. We are quite sure that Merge generated objects interface with other cognitive systems and it is useful to assume that these systems had whatever properties they now have before Merge emerged. But this is likely false (or, I wouldn't be surprised if it is). As an idealization we can assume that AP and CI have whatever properties they have independently of whether Gs have a hierarchical recursive mode of combination. We can then ask how this arose. What I am pretty sure about, and you agree, is that whatever the precursors were they did not give you partial recursion etc. This came "all at once."

      To repeat what I said in another comment: we should not confuse Merge with FL. The latter is certainly complex and contains much besides Merge. IMO a reasonable project is to try and figure out if there are general cognitive/computational properties (not linguistic specific) and ask if adding Merge to these would give you an FL like ours. This, as I understand it, is what MP is all about. Merge on this view is the only linguistically specific operation we need to assume as part of FL to understand its combinatorics. This is a squishy question (as are all research questions) but one that we have a partial handle on. That's enough for me.

  5. I am reminded here of Plato's famous belief that intelligibility was to be found only in the world of geometry and mathematics, with the complex world of sensation being unapproachable. An effective study of astronomy, in his view, requires that 'we shall proceed, as we do in geometry, by means of problems, and leave the starry heavens alone'. His belief in the importance of higher-order abstractions to properly characterise complex behaviour echos in Krakauer et al.'s ( recent analysis of the current state of neuroscience. The authors propose that behavioural experiments are vital to the field, providing a level of computational understanding lower level approaches cannot. Neuroscientific experiments can give us interesting data, which can indicate which level we need to be investigating: the same level, a higher level, or a lower level. We might find out that we need to go lower, into the 'connectome' from brain dynamics, for instance - or the opposite. So there's no a prior route to investigating the neural basis of Merge - we just need to go via whichever path permits causal-explanatory linking hypotheses between distinct levels of analysis.

    Concerning the other 'mechanism' Norbert has in mind (labeling), see here:
    and here:

    Picking up on Cedric's theme of irony, for someone who commonly claims that B&C are simply recycling old campfire stories about Merge and not really progressing the field in the proper way, it's indeed ironic how often he seems to recycle quotes and ideas clearly taken straight out of Chomsky, e.g. the Turing quote at the beginning of the Inference piece, and also the Hume quote here. Cedric could simply be using Chomsky's own material directly against him here in a not-so-smart-now? kind of way, but if so, that strikes me as slightly immature.

  6. I am a little sceptical about the argument that recursion* is an all or nothing property and therefore Merge must have arrived suddenly. Could someone take a stab at fleshing it out a bit?

    It seems like it only goes through if you take a very literal view of the psychological reality of generative grammars.

    * recursion in the sense of a recursive production in a grammar I guess rather than in the recursive enumerability sense that Chomsky uses.

    1. Indeed there is no reason to think that from the formal analysis of a property alone one can derive the evolutionary steps (or lack thereof) it took for it to arise. It takes a little more than that. This is especially worrisome if the conclusion that a property must have arrived suddenly is used to discourage biological research on language evolution. So there are two choices: i) accepting the sudden-emergence, "only us", mystery view and moving on (not sure where, but to each his own) or ii) not accepting it and proceeding with investigation along the lines of evolutionary biology, comparative cognition, etc. in order to know more about it.

    2. Alex, literal as opposed to figurative? I always try to be literal. It's safer. Remember, the price of metaphor is eternal vigilence.

      Pedro: I have no idea what you are talking about. The question is how to break recursion down into discrete steps. How do you do this? Show us an example of how you get "and keep going ad libitum" without simply saying in some way "and keep going ad libitum." If this is roght then it is a good thing to discourage bio research into THIS for it is fruitless to engage in it if it is wrong then show how. Unbounded hierarchical recursion seems like the sort of thing that cannot be broken down into steps. Maybe this is wrong. If so, demonstrate.

      Btw, there are ther questions of bio interest one can ask, just not about this. For example, how were other systems retrofitted to service this new capacity? But this presupposes the recursive property is in place, not that IT evolved.

    3. Here is an obviously bad argument for comparison.

      Suppose someone studies animal navigation and says: to navigate you need to compute the distance between 2 points, and even if you store the two points as rational numbers, the distance will be an irrational number.

      Now being an irrational number is an all or nothing property -- either a number is rational or it is not -- therefore animal navigation must have evolved through a single step -- the basic property of animal navigation is the ability to represent and use irrational numbers etc etc.

      My scepticism about the recursion based argument is basically of the same type as my scepticism about this argument.

    4. Animal navigation may not have evolved in a single step but if you are right about the non reducibility of rationals to anything else then the availability of rationals did not evolve either. Their use in navigation however might have as navigation involves MORE than just rationals as variables. Ditto language. Merge itself is all nor nothing, but FL may not be (I think it isn't). There is more to language than unbounded hierarchy. But there is AT LEAST that. So it is reasonable to ask how it arose and to observe that it (like the rationals) could not have evolved step by step as it is entirely unclear how it could have evolved step by step. As Dawkins likes to say, it is a callable subroutine or it isn't. We have an induction step or we don't. We can take our outputs as inputs or we cannot. And being able to do so once does not explain why we can do so ad libitum. So, yes, Merge popped up all at once. This does not imply that FL did or that FLW did and FLW is a big deal. So evolve away, but don't expect a step by step account for the emergence of Merge. There won't be one.

      Hope this helps.

    5. That's kind of the point that I disagree with.

      It's the implication from:

      A) there is a sharp distinction between mathematical objects of type X and of type Y


      B) If we use mathematical objects of type Y rather than X in some computation, then that computation can't have evolved gradually.

      That implication seems to need some work.

    6. I guess we do disagree, not for the first (or last) time. If we have evidence that some format is implicated in the system and that format is qualitatively different from what was there before then there is little reason for thinking that the emergence of the novel format proceeded in small quantitaive steps. You dont get the pred calc from the prop calc in small steps nor do you get unbounded hierarch from small additions to bounded hierarchy. We disagree. Oh well.

    7. "So evolve away, but don't expect a step by step account for the emergence of Merge. There won't be one."

      Yeah, this seems like the big mistake. Merge doesn't arise without a lot of stuff preceding it--you need a lot of cognitive architecture in place before you're going to get it. The story of the evolution of that cognitive architecture IS part of the story of Merge, at least under the evolutionary way of looking at things.

      If all you mean is, "Look, for a long time we didn't have Merge and then one day someone did, because of some genetic mutation," then that's trivially true but misses the point of an evolutionary approach entirely.

      To give a more concrete example, let's talk about the ability to repeat a string of phonemes you hear. There's a whole series of components involved in this, up to and including the fine control of the articulatory apparatus. Say there's five components, all of which are necessary. These steps can, in principle, evolve in any order, and you can't do the task without all of them. Once that last one snaps into place, it's tempting to cry out, "Aha! That's the mutation that led to the ability to repeat a string of phonemes!" But in a very important way, that's not true. The last one could've been in many ways trivial (say, a slight tweaking of musculature), and by focusing on it you're missing the real picture of what happened.

      Translate this to Merge, the ability to "put two things together" (or whatever definition you want to use). It also requires other abilities, including (1) the ability to distinguish a "thing" from everything else, and (2) the ability to mentally hold on to that "thing" long enough to put it together with another "thing." For all we know, (2), or some expansion of working memory/phonological loop ability, was the last component to fall into place. It made Merge visible, but it was already basically there.

    8. The kind of argumentation that Alex Clark is bringing attention to is indeed what I was talking about. It's weird that big claims can be made about the evolution of a property with a purely logical argument, that makes no reference to biological substrates, and is based solely on formal analysis. FL _as a whole_ is very obviously not all or nothing, and for that we don't need big arguments or speculation. It's complex, it's rich, it's messy. That much is clear (and it's fruitful if not necessary to learn more about that mess). For parts of it that can't be "halved", like merge, we have to be clear about what level of analysis we are talking about, and if confinement to that level is enough to make claims about the big picture. I think it's not, for different reasons (in no particular order, and also using some of the claims made above):

      - the fact that it's unclear how merge could have evolved does not mean it didn't evolve. I don't think this needs much defense (argument from incredulity yadda yadda).

      - there is no relationship between the formal analysis of a property and the number of evolutionary steps that led to it.

      - work on what goes on in the brain (as opposed to the dated claim that we know nothing about the brain) when recursive operations take place will reveal -- actually I don't think I need the future tense here -- that there isn't a nub somewhere that says "merge", that just popped up one day. There's intricacy, there's parts, and there's history to them.

      -The evolutionary steps of properties don't (have to) look like bits of that property put next to each other in succession.

      - it's easy to claim something arises "suddenly" once all the parts are in place.

    9. It is not an argument from incredulity, it is an argument from "show me!" If it is possible that it evolved tell me a possible story indicating the analytical links. If you cannot do this then it seems to me that you are conceding that you have nothing to say. Sure it is "logically possible" that it evolved in the standard way. It's also logically possible that there are angels that guide natural selection or that the Great Blue Bear is ready to eat the North Pole. Show me. Provide a piecemeal story and then we will talk.

      The phenotypic gambit, common in eve and that I discussed in an earlier post, makes just the assumption that there is a relationship between the formal analysis and the course of evo. It does this because it recognizes that analytically breaking down a problem suggests how what we see could have arisen. So, you think that Merge occurred in steps: show me the POSSIBLE steps. Show me how they followed from simpler ones under natural selection. Show me how to get the inductive oomph from non inductive parts. Show me. Once you do, we can talk again.

      'Intricacy' is a word that hides ignorance. Next you'll be using "emergent" and "complex". This is not really helpful.

      It's also easy to say that it emerged from smaller components when you fail to specify the components. Look, you sometimes cannot get there from here. You can't get the predicate calculus from the propositional calculus thought the latter contains analogues of the former. So someone who tells me that the former evolved from the latter is likely talking rubbish. I am saying the came thing wrt the property of unbounded hierarchy. It's easy to show me wrong: just show how it COULD have happened.

    10. Stephen P:
      "Merge doesn't arise without a lot of stuff preceding it--you need a lot of cognitive architecture in place before you're going to get it."

      Why assume this? What I assume is you need a lot of cognitive architecture to get an FL, but that you don't need much to get Merge because Merge does not build on previous architecture. It is a primitive operation whose properties do not reflect prior structure. Think of arithmetic and addition. You need numbers but it does not "build" on anything prior. It does not evolve from anything else. If there is addition it comes from nothing, creation ex nihilo. Of course, to be useful, we need numbers and memory and many other things. But IT comes from nothing else and builds on nothing else, unlike say wings in insects building on structures made available by thermoregulation. There is no analogue of this wrt to addition or Merge.

      What you and many others are running together, from my perspective, is Merge vs the expression of it in performance. Sure, to display linguistic creativity for which Merge is essential requires more than Merge. Just like to be able to reckon a bill requires more than addition. But Merge does not build on earlier structure (or if if it does I want the story) and addition doesn't either. It doesn't come in dribs and drabs built from prior structure the way wings for flight might. If the operation is primitive, it is NOT but up from prior structure.

      Now this still leaves lots for evolution to do, especially wrt FLW. It just does not leave anything for it to do wrt Merge. And that is Chomsky's point. You don't like it. Ok, show how the operation builds on prior structure, not the use of the operation (everyone agrees with this( but the operation itself. I don't see how to do it, but then again as Pedro has pointed out, I have a limited imagination and on the matters I am always willing to be schooled by the more talented.

      At any rate, I think this discussion once again indicates what is so wrong with much evolving discussion: it often fails to identify the trait of interest. So let me be clear, if they trait of interest is the merge operation or whatever it is that licenses the GENERATION of unbounded hierarchy, it is quite possible that it builds on nothing prior. It is primitive. And if so, it does not evolve, it simply comes into being however non-evolved traits do (e.g. mutation).

    11. "Why assume this?"

      Because I literally can't figure out how else it can work. Merge puts two units together into one set. By definition, you need "unit," "set," and "puts together." There's no reason that all these needed to appear at once. (For that matter, it's perfectly possible that Merge could operate in some (linguistic) contexts but not others, and slowly expand over time.)

      "What you and many others are running together, from my perspective, is Merge vs the expression of it in performance."

      I tried to distinguish between those two in my earlier post, actually, because it seems to me that Chomsky is confusing the two (which is understandable; it's hard to study Merge or anything else unless it's expressed).

      "it is quite possible that it builds on nothing prior."

      I honestly don't understand what this could possibly mean. What did the mutated gene *do* prior to mutation? What cognitive operations were in place prior to mutation?

    12. Or try this: if Merge can arise "ex nihilo," please explain how to give it to a crayfish, or a moose, or a paramecium, or anything else other than a human. Is that possible? Is it equally possible regardless of the organism? If it builds on "nothing prior," then the existing structures don't matter, right?

    13. I am sure that if merge can arise "ex nihilo" or "e.g. mutation", one day we will indeed be able to give it to a random animal. All we need is a fertilized egg and to know what the magic gene is. Especially given that FL is "at least merge", depending on the animal we might even be able to strike a conversation after a couple of years of talking to it (at it?)

  7. Alex's first comment made an assumption that recursion was about recursive production rules in the grammar and that's what he was worried about `appearing all at once'. But I don't think that's the argument in Why Only Us, is it? I thought the argument was that what distinguishes us from other species is *at least* some mental capacity which allows the recursive enumeration of an infinite set of hierarchically organised structures. I think that that argument goes through quite straightforwardly: it's just a version of Hume's argument (there's no inductively valid step from 1, 2, 3 to infinity, so you need to stipulate something that gets you this) plus the notion that whatever you stipulate has to be implemented in human brains, but apparently isn't in other brains. The question is what exactly it is that is so stipulated/implemented - the only theory we have of infinite enumerability is computability theory, and that requires composition/substitution of functions so they can take their own output as their input, primitive recursiveness, and minimal search. As far as I remember in my dim distant past understanding of computability, only the first of these is required for getting the successor function up and going (Alex, correct me if I'm wrong), so something analogous to that is all that is needed to get the hierarchical structure version of the successor function going. That would be Merge, or whatever bit of how Merge is defined that is reducible to the composition of functions. No matter if you break Merge up into something simpler, there's always going to be some bit of it without which you won't get infinite enumerability. Call that primitive Merge. Then that is what the relevant modification is. All the rest of FLN/FLB could come beforehand. Or indeed could come after (e.g. mapping principles in FLN). But without that primitive Merge bit, the enumeration of infinite hierarchical structure won't fly. I'm not saying that I'm completely compelled by this argument - there could be alternatives - but I think it makes logical and evolutionary sense.

    1. Norbert: I was saying precisely that purely logical arguments are not enough. It is logically possible that merge emerged suddenly. Biologically, not so sure. I don't see how an "argument from "show me"" is a good one though (Bill O'Reilly used it with Dawkins once. Didn't look so good). But anyway: as much as I would like to sit at my desk, with pen and paper, and show you, it's a bit harder than that. It takes research. That's why logical arguments are not enough. We need to investigate what goes on in the brain, in order to know what the thing we label "merge" rests on. See what's shared, across domains and species, what's not... The problem (it's a problem) is complex (Since I still don't know everything there is to know, I'll use the word. Don't see a problem with it. You didn't either when you used it in this very comment section).

      But if "showing" is the name of the game here, I would like to know what the saltationist story is. How exactly did Merge arise suddenly? so far I've just heard the claim; not the story (btw: "it simply comes into being (e.g. mutation)" is not a story).

    2. @Pedro: I think stances like yours, which I see all the time in the literature, about now invite the replacement of the 'No True Scotsman' fallacy with 'No True Biologist'. I don't believe that there is much content to the negations you've offered of Norbert's arguments except the purportedly infallible generalisation that biology is messy and complex and nuanced and filigreed, so No True Biologist would make claims as to the mental computations underlying some behaviour without doing some Proper Comparative Biology. Maybe you're right, but you're giving only a methodological narrative and your taste for it, not an argument for its necessity, and others clearly don't see the necessity of it.

    3. @Callum: I think there's no *necessity* for one or the other position. (if there were, I'm sure everyone would be on the same side by now). [I mean, of course that personally I think there is, but I know very well that doesn't convince anyone, because I'm some guy, and not "the guy"]. The fact of the matter is that one position is methodologically confined to one island of knowledge, and leads to stagnation, and the other is not, and leads to more investigation. Parsimony is supposed to take you forward, not stay put. - Do I wish that more people would do biology when they make biological claims? Yes. - Do I care if some don't? In a way yes, but deep down not really. People disagree all the time. - Am I expecting great discoveries about language coming from the "sudden-emergence because you can't have half-recursion" in the future? No. By definition I guess that's impossible; they think "merge" is the solution - Do I think actually studying the cognitive biology of recursion (and many other things) would lead to important discoveries, as opposed to declaring merge a biological exception on non-biological grounds, and from that the language evolution is a mystery? Yes.

    4. David, the same issues arise in the move from finite to infinite, but I am not sure that that is the right place to locate the difference.
      If the ability in question is recursive enumeration of an infinite set, then not having that ability means either
      not being recursive (i.e. not being computable) or not being able to handle infinite sets.
      But of course the biological system in question is bounded by finite performance limitations, so all of the sets we are interested in are in fact contained in very large finite sets.

      So while there is a sharp boundary between finite and infinite sets, I don't think there is a sharp bound between
      biological systems that process finite sets versus biological systems that process infinite sets.
      The competence/performance distinction -- which for me is one of the perfectly reasonable assumption of generative grammar --
      means that there may not be a difference between a "recursive" competence grammar and some limited processing systems,
      versus a "nonrecursive" competence grammar and some less restricted processing systems.
      (I don't think this is the case for current natural languages, but it could be for some proto-languages)
      Of course if you believe that competence grammars are *literally* inscribed in the brain, then there might be a fact of the matter
      but if you don't -- if for example you just think that they express certain regularities in processing systems (stealing a phrase that Greg Kobele uses) -- then there may not be a sharp distinction at all.

      So one could amplify this with examples from the evolution of recursive neural networks and their abilities to process unbounded sets; where you can have a gradual change from a RNN that implements a recogniser for a finite set of strings to one that implements a recognizer for an infinite set of strings without there being any particular point at which there is a jump. Willem Zuidema wrote something about this at Evolang recently.

    5. Thanks Alex. I think I agree with everything you said here. But I guess I do think that there's at least plausibly a difference between cognitive systems that can handle infinite sets and ones that can't and I think competence grammars are literally inscribed in the brain (or at least some sub portion of them is - the bit that handles infinity). We have linguistic biases that are nothing to do with processing, I'm pretty sure. They guide acquisition and are just part of our cognitive set up out of the box. But maybe that's why I find the Why Only Us story logically and evolutionarily possible though I'm not really qualified to comment on its likelihood. If you have the Zuidema ref that'd be really interesting.

    6. It's here but it's only a long abstract

    7. Alex: Do you know if this Zuidema paper is on-line? One problem in the current context with an appeal to RNN, is that the syntax (i.e., the parse) is specified independently, rather than recovered from the raw data. I take it that the Zuidema approach seeks to recover the structure without syntactic supervision.

    8. Hmm. I'm not so sanguine about the capacity of LSTM RNNs to generate human style syntactic dependencies. Especially unsupervised, where they really don't seem to do what humans do. Tal Linzen has an interesting discussion of this in a narrow domain (actually a narrow domain that Chomsky pointed out in Syntactic Structures was going to be a problem for stochastic models, no matter how sophisticated). Its at I have a brief discussion of the relevance of this in a paper I'll stick on lingbuzz (if Norbert, who's editing the volume, doesn't object!).

    9. The argument in my Evolang talk was indeed that we can train various neural network architectures to approximate/implement recognizers for formal languages that we normally describe as having recursive, hierarchical structure. Before training, there is no sense in which they are recursive, but recursion gradually enters the system.

      Rodriguez (2001) showed this for simple recurrent networks, trained to recognize A^nB^n. We have some results showing this for Recursive NNs (which indeed take the a parse as given) and Gated Recurrent Units, trained on computing the outcome of simple recursive, arithmetic expressions.

      It's true that these models might not 'generate human style syntactic dependencies' as David writes. But the point is that they show a gradual route to a system that, at Marr's computational level, we would describe as recursive, while at the neural level there are no structural changes, just weight updates. I think this provides a powerful image for how we can think about the neural implementation of natural language syntax. And it provides an important warning about the conclusions one can and cannot draw from a computational level analysis.

      The talk has not been written up, but a paper with some of these results for a CL audience is here:

    10. Many thanks for the link, Willem. I couldn't tell from the paper, but do all of the models discussed operate with a syntactic/symbolic supervisor? If so, is there a way to disentangle what is given by the conditions on the supervision from what is genuinely novel? For example, you want the system to embed in specific ways, not willly-nilly, but what stops it being willly-nilly? Also, it is well to note that the Fodor-Pylyshyn objection to networks was not that they poorly model neural implementation, but that they fail to model the constraints on the computational processes such implementation realises. In an evo setting, this is going to play out differently, but I'm still not sure of the moral, if the story makes no sense at the computational level.

    11. The models we studied were all trained using supervised learning, i.e. they receive a training set with input-output pairs such as:
      input: (1+((2+3)-(3+5)))
      output: -2

      The models differ in how the brackets are interpreted. In the *recurrent* networks (SRN and GRU), the brackets are treated, like numbers and operators as words, and all words are fed one by one to the network. In the *recursive* networks, the brackets are interpreted by a external 'symbolic' control structure to determine which words are combined and in which order.

      The moral , as far as I'm concerned, is that the statement 'recursion is an all or nothing phenomenon - you can't have a little bit of it' is based on a silent assumption that the cognitive machinery underlying natural language is fundamentally discrete and symbolic. If you take the view, as I do, that the underlying machinery is continuous, you draw different conclusions: even though at the computational level modern, adult language is a pretty well characterized with a recursive, symbolic grammar, a gradual route to it in both development and evolution is very well conceivable.

      For some the discrete, symbolic nature of language might seem selfevident, or established beyond reasonable doubt. I think that is a mistake. In fact, I think the recursive structure in natural language was one of the strongest arguments in favor of assuming a fundamentally discrete, symbolic nature of cognition. Now that we understand how recursive-looking language can be generated by a fundamentally continuously-valued system, that argument kind of evaporates.

    12. The way I understand John's question, or in any case, the kind of question I always feel isn't addressed in this kind of work (interesting though it is, I really appreciate every paper that tries to de-blackboxify neural networks), is:
      What does this kind of finding - system can learn hierarchy if input exhibits it, but would learn whatever the input exhibits, really - have to say about the explanandum that the input _does in fact_ exhibit hierarchy?

    13. @Willem. Is it necessary to go to the trouble of training neural networks to make this point? If you are willing to accept performance that "deteriorates with increasing length" (as the abstract puts it), then of course there are all kinds of devices lacking recursive symbolic representations that can do a decent job of evaluating arithmetic expressions. A finite state transducer, for example.

    14. Thanks Willem. Benjamin nicely articulates my underlying point. Since the language out there anyway (as it were) is not recursively hierarchical, but our understanding of it is, then the putative explanation of this capacity in terms of some processes not recursively specified is not furthed by assuming the very hierchical strucutre at issue in a supervisory role. Why? Because absent such a specific supervision, the system would have responded to the input in some entirely different way.

      Also, I do not assume, and here I follow Fodor-Pylyshyn, that anyone need assume that symbolic discreteness at the computational level entails something similair at the neuronal or implemenational level. So, your moral is OK as it stands, at least without further ado. I still don't know what the gradual computational story would be. There still seems to be the leap to recursion, even if all underlying processes are continuous.

    15. @Alex I don't think a finite state transducer can do this since it needs to maintain an unbounded stack of partially completed expressions in the worst case.

    16. John and Benjamin: Willem may have other points that he wants to make, but in the narrow point of the gradual emergence of recursion, I think the recurrent RNN model is sufficient to make the point. A huge difference which I think is not conceptually important here is the distinction between gradual emergence via learning versus gradual emergence via evolution. I think the same arguments suffice.
      Whether this explains the origins of recursion is another issue entirely.

    17. Alex: Let's agree that phylogeny and ontogeny are not necessarily best guides to each other. I didn't assume otherwise. My worry can be simply put. If we are imagining the relevant RNN models to be models, albeit very roughly, of a possible evolutionary process, or even just to tell us something about such a possible process, then it behoves us to wonder what aspect of the putative evolutionary process is supposed to correspond to the syntactically informed supervisor? The answer isn't God or interested aliens. A better answer would be, 'Oh, the supervisor is an artefact of the modelling, which has no correlate in the evolutionary process'. Fine, but, the supervision looks not to be an artefact that can be dispensed with or cashed out in other terms with the behaviour of the system remaining the same; on the contrary, prima facie, it carries the explanatory load. Y'know, there is no teleology in nature.

    18. The recurrent model doesn't have syntactic supervision. (The recursive one does ... terrible terminology I know)

    19. @Alex. Yes I know. That is why I mentioned performance decreasing with the length of the input.

    20. Right, but we are talking about the recursive ones, which Willem above explained as all involving supervision.

    21. I think the recurrent neural networks illustrate the gradual emergence of a system that exhibits recursive behaviour. I agree that if the input is trees as in the recursive NN models, then it doesn't really bear on the point at issue.

    22. To amplify, if we take the starting model (recurrent neural network before training) to be at time t = 0, and the final model (after training), to be at time t =1, the training sequence (suitably interpolated) will give a continuous varying set of models between t= 0 and t = 1, where there is no discrete change and yet at t= 0 the system does not exhibit recursive behaviour, and yet at t = 1 the system does. And I think this potentially constitutes a counterexample to Norbert's claim that recursion is an all or nothing affair.

    23. @Alex, sorry, posting without reading carefully.

      From my point of view though, a finite state transducer or indeed a "humongous look up table" would not serve as an appropriate counterexample.

    24. Thanks Alex. Yes, I see the idea, but I'm still not sure of the significance. I take Norbert's point, or at least the point Chomsky and Dawkins make, to be a formal one: you either have a function that can call itself, or you don't, just as any function that enumerates a finite set greater than some other finite set is equally distant from enumerating an infinite set as the function that enumerates the lesser set. That looks like some kind of conceptual truth, but it doesn't, all by itself, tell us how any system can or does recurse, as it were, in a way that takes us beyond the specification of the relevant function. Still, the formal truth imposes a constraint any other story we want to tell, viz., the system must respect some formal condition, such as, at some point, employing a variable defined over strings or being able to loop (something like that). If nothing like that is provided, one is left scratching one's head. Put naively, the point is, given the formal truth of the great leap, at what point between 0 and 1 corresponds to the leap? What happened? If there is no answer, then the behaviour of the system at 1 might not be recursive after all, it might just be very robust over length, say. I hope that makes sense.

    25. I don't really know what recursive means here but assuming that a push-down automaton counts as recursive in the relevant sense, then one can view these recurrent NNs as being an approximation of a deterministic push down automaton where the stack is encoded is a vector (of weights of hidden units).
      (This is an oversimplification in several different respects).

      A function that calls itself would then correspond under some circumstances to a geometrical property of certain spaces which could be approximated gradually in various ways. So sure a circle is an all or nothing property but things can be round to greater or lesser extents. There need be no great leap.

      But more generally why should a formal property of a theory of X correspond directly to some property of X?
      Even if the theory is right. What's the standard philosophical example here? Maybe centers of gravity.

    26. Thanks Alex - very helpful. So, is your position - or at least the position you are entertaining - something as follows. There are formally specifiable devices (the kind of devices one finds on the Chomsky hierarchy), but it is potentially a mistake to think that such devices are real beyond systems approximating the idealised behaviour of such devices. In particular, the sharpness of the distinction between one device and another (going from finite embeding, as might be, to unbounded) need not be mirrored in an underlying system that simply approximates one device or another without respecting the sharp cut offs formally specifiable at the elvel of the devices.

      If that is the position, then a load of questions arise, but one radical thought hereabouts is that we simply are not recursive devices, but only approximate them. That sounds unduly performative, though. Think about the circle case. There are no circles, in one sense, but we see and reason over circles, not irregular arnitrary shapes. Likewise, our understanding of language appears to be sharply unbounded, even though our performance will only ever approximate unboundedness. So, the metaphysical question (well, it might just be the evlutionary one) is why we evolved a system that approximates an ideal device such that the device, rather than the approximation, reflects our understanding, just like we see and understand sharp Euclidean figures rather than variable irregular shapes.

      I don't think a formal property need correspond to some 'real' property, but insofar as the the formal property is not an artefact of the notation in some sense, it must have empirical content, and so there must be some property, potentially a quite complex one, which the formal property is tracking.

    27. I think the transition from a nonrecursive to a recursive system could have occurred prior to some more explicitly recursive devices evolving, so I think the validity of the argument that Merge must have evolved instantaneously is independent of one's position on the neural reality of recursive grammars.

    28. Let's put aside the evolution business for the moment. The RNN results are interesting but are difficult to interpret. For almost all we do in formal learning research, a class of languages is defined and then we study its learnability properties (e.g., FSA, CFG, MG, etc.) The RNN stuff is different. It is an algorithm (A), and presumably there is a class of languages, L(A), that is learnable by A. So far as I know, no one in machine learning works in this algorithm-first-language-later fashion.

      The results, if sound, show that L(RNN) contains some elements similar to human languages. I'm afraid that unless we understand formally what an RNN actually does, we are not in the position to say much about anything.

    29. @Alex - A finite-state approximation of a recursive system is much less interesting, and wouldn't exhibit the graceful degradation with increasing length that we observe in RNN's. You want a model that captures the fundamental relation that exists between strings of different lengths, and that can exploit these relations when, for instance, the language is extended. E.g., imagine we build model of the contextfree language A^nB^n over classes A and B. If we add another member to class A, we want our model to easily accommodate such a change. A PDA does, and so does an RNN with the right weights -- even if its state space is continuous -- but discrete finite-state automata do not.

      @Benjamin & John - You are absolutely right that the models I discussed (here and at Evolang'16) do not address the origins of hierarchical structure. I was only concerned with the question whether or not 'a gradual route to recursion' is a logical possibility (but check out my Evolang'08 abstract :)). I'm the first to admit there are many fundamental questions to be asked about how neural machinery may deal with linguistic structure, and about why natural languages have the complex structure that they do. The frame 'language is recursive, recursion is an all or nothing phenomenon, hence recursive language must have merged in a single sweep' is, however, unhelpful in answering those questions (just as unhelpful as 'language is context-sensitive, context sensitive languages are unlearnable, language is not learned' -- Evolang'02), because the inferences are incorrect.
      The graceful degradation also answers John's other concern - that there would still be a 'leap' to recursion, even if the underlying processes are continuous. If we agree on a specific criterion for 'recursiveness' there is of course a particular point along a continuous trajectory where we might suddenly meet that criterion. But what would the criterion be? 5 levels of embedding? 4? 3? Any criterion is arbitrary. A neural system can approximate the Platonic ideal of a symbolic recursive system to an arbitrary degree. The Platonic ideal (the competence grammar, if you like) is useful, but not because it is the true underlying system corrupted by performance noise, but because it is the attractor that the continuous system is approaching. Knowing what the attractor is, and *why* it is the attractor, is immensely useful. In the same way, good symbolic grammars that capture important generalizations about language might be immensely useful. But they shouldn't be confused with the cognitive reality.

    30. @Charles - you're absolutely right that most formal learning work starts with classes of languages, and tries to find learning algorithms (or guarantees for learning algorithms) for that class, and that in this work (as in most current machine learning, I would say) the learning algorithms come first and questions about what class of languages they can learn are largely unanswered. But I'd argue that formal learning theory has it the wrong way around. The primate brain was there first. Language, when it emerged and started to be transmitted from generation to generation, had to pass through the bottleneck of being learnable by that primate brain, plus or minus a few human-specific and perhaps language-specific innovations. The class of learnable languages ("constraints on variation") is thus in many ways a derivative of the algorithm.

    31. There is a nice quote from this paper in Nature which has Partha Niyogi as a coauthor, which puts this very well:

      "Thus, for example, it may be possible that the
      language learning algorithm may be easy to describe mathematically
      while the class of possible natural language grammars may be
      difficult to describe."

      It has some interesting discussion on this topic, though it is quite technical in parts.

    32. @Willem. It’s actually very easy to get graceful degradation with length using a finite state transducer. Say that the FST has n bits available to store context, and think of the context as a stack of numbers. If the transducer has to store m numbers, then it can use n/m bits to store the value of each number. When an additional number is added to the context, the number of bits available to represent the existing numbers decreases. So the more numbers the transducer has to ‘remember’, the less accurately the value of each number is represented. This is of course exactly what we see with floating point arithmetic in computers. You can convert an array of 64-bit floats to an array of 32-bit floats and halve the storage requirements — but at the cost of precision.

    33. @AlexD With 'finite state approximation' I meant a FS automaton that needs a separate subgraph for each length of string. You're talking about an automaton/transducer with a memory, very similar to a PDA (but with a bounded stack). And yes, such a model can both model the recursion and show graceful degradation. In fact, you could use it to make the exact same point I was making with RNN's: by gradually varying the memory capacity from 0 to infinity you show a gradual path to recursion.

    34. @Willem. I originally said "finite state transducer". The device I described in my previous comment is a transducer with finite memory. Depending on exactly how the relevant terms are defined, it either is a finite state transducer, or is trivially equivalent to one.

      You could indeed use such a device to make the same point as you were making with RNNs. That is why I questioned whether the RNN simulations are really all that relevant to the point that you, Alex C and others are making about recursion. It seems to me that this is a very straightforward point that can be made in a few sentences on the basis of a bit of armchair reflection.

    35. Agreed. But somehow people keep making the same argument about 'no gradual route to recursion', so I have tried to illustrate the counterargument in various ways. Plus: we're interested in RNN's expressivity for their own sake, and in ways of learning hierarchical structure from examples (here through backprop).

    36. This comment has been removed by the author.

    37. I feel the RNN example is a little more convincing than eg a PDA with a finite bound on the stack, for a couple of different reasons but that depends on some of the details of the argument that it is meant to be a counterexample to, and those details are a little obscure. (for a start, what sense of "recursion" we are meant to be using; Willem and Alex D seem to be converging on a sense that has something to with the boundary between regular and context-free formalisms, and I am not sure that is the relevant sense.)

    38. @Alex C. I'm not sure if I really understand the argument either, and I don't have any particular sense of recursion in mind, for whatever that's worth.

    39. @Willem. Thanks for the clarification and the further interesting thoughts. Just a little point, if I may. I agree, of course, that any specific level of embedding would be aribtrary as a criterion for 'recursion' (here just letting recursion be CF-ish). I take the Chomsky-Norbert line hereabouts precisely to be that you only have recursion if there is no finite level that would suffice for recursion; hence, there is no finite criterion. That is, the relevant rule/principle/function delivers an infinite output (any restriction would presuppose the unrestricted case). Thus, no mere approximation to the ideal would count, just as, as discussed above, no irregular shape counts as a circle. What is real here is nothing platonic, an infinite object, but just the principles that apply unrestrictedly. Something like this holds for counting. One can count only once one has a conception of the infinite, albeit implicitly. FWIW, I tend to steer clear of evo issues, but recursion in the relevant sense is an all or nothing property, so I still wonder how one can arrive at it gradually. If I read you aright, you claim we never arrive at it, but only approximate it, which seems conceptually awry, although I appreciate how it undermines the argument at hand.

    40. What is the evidence that *humans* arrive at it, rather than approximate it?

    41. Their understanding of any given instance of a relevant construction provides the evidence. If I understand 'The boy behind the girl', then I understand 'The girl behind the boy behind the girl'. That is something like a logical truth. There is no finite bound on this understanding, although there is on performance; any such bound would be arbitrary, and get the counterfactuals wrong (if I had enough time,...). It is a bit like asking how do you know there is no greatest natural number, maybe we only ever approximate the unbounded. I think empiricism is conceptually false for these kinds of reasons. It trades in the clear and distinct, as Descartes would have it, for a more or less approximation without explaining what is approximated, for what is approximated is presupposed in treating the approximation as an approximation to it.

    42. What makes this new generation of networks so interesting is that they appear to pass an equivalent kind of test for 'recursiveness' that you now formulate for humans. I.e., if the network we describe understands (2+3)-7, then it also understands (7-3)-(2+(3+2)). There is no finite bound on this understanding, although there is on performance: for longer expressions the expected error becomes larger.

      The models remain simplistic, and it is easy to find unrealistic properties. But we should use these models not as efforts to *imitate* human language processing, but as conceptual tools: as a kind of thought experiment that helps us critically assess whether the reasoning steps we take as linguists -- the ways we go from empirical observations to theoretical constructs -- are valid.

      I don't think 'empiricism' or 'nativism' are very useful labels anymore for this kind of discussion. When I talk to people in old school connectionism, they think I am too obsessed with hierarchical structure, and too open for language-specific biological specialisations. But I think both camps are missing something important: there is non-trivial, hierarchical structure in natural language, and there are now neural models that can really do justice to it, in the sense that they support the kind of generalizations that you discuss. Symbolic grammars remain a useful level of description, as they avoid the unwieldy mess of matrices, vectors, and nonlinearities in neural networks. But they are best thought of as an approximation/idealisation of the messy neural reality. For many linguistic questions working with that approximation is fine, in the same way that Newtonian physics is perfectly fine for physic at a day-to-day scale. But for some questions, including those about the neural basis and evolution of language, it is not.

    43. @Willem. That is interesting. I was taking the view to be one that didn't really endorse a comp/per distinction. How is understanding depicted when it separates from performance?

      Just on 'empiricism': I had in mind more the traditional picture of abstraction as a means of arriving at the constituents of cognition a la Locke et al., rather than a direct opposition to nativism, although the two strands obviously coalesce.

      I agree that reality is messy. I suppose the issue is how far explanation will take us into the mess or force us to always remain at a level of idealization and treat the mess as noise. Please to hear the PDP-ers are aghast:)

  8. This comment has been removed by the author.

  9. Reposted with my actual name.

    Here is an attempt at something that might be considered an incremental derivation of recursion (in the computational/algorithmic sense), although I'm not sure exactly what the evaluation metric for "incremental" vs "non-incremental" is:

    1) Minds are able to reference statements (DO x), which are immutable and unordered with respect to each other

    2) Minds can group immutable statements into series of ordered statements (proto-functions)

    3) Statements and groups of statements can take numerical arguments (i.e,. you can change the magnitude of some element of X in your DO X statement)

    4) Statements and groups of statements can take other statements/groups of statements as arguments (recursion!)

    Now, technically only the move from 3 to 4 generated recursion, but it was built on a number of prior steps that set the groundwork for recursion. Does that count as incremental/provide the kind of sub-components that Norbert is looking for?

    1. I think you prove Norbert's conjecture that people are running together the function of Merge with its particular use in human FL. What your steps in 1-3 do is prejudice the input that the recursive function eventually takes, but the step from 3 to 4 is only dependent on 1-3 to the extent that it requires those specific inputs to exist. Given the bare possibility of some mental system that consists of a function that takes its own output as input where that input cannot be described by your 1-3, then the recursion secured by 3-to-4 must be independent of the precursors you specify. So in what sense is it any more illuminating to say that 1-3 preceded Merge in this specific instance than it is to say that eukaryotic cells preceded Merge?

  10. I agree that the change from 1-2 could be thought of as irrelevant, but the change from 2-3 establishes the notions of functions (i.e., statements that take some input). 3-4 builds on this step by expanding the domain of possible inputs to include other functions, which creates recursion by now allowing a function to be its own input. If that isn't a incremental development of recursion, I'm not sure I understand what an incremental approach to any phenomenon would be. Incidentally, it might be helpful to lay out a classic case of incremental evolution for comparison to get a sense of the relevant granularity.

  11. It seems like most of the counterarguments to Chomsky’s position could be summed up using an old analogy: “Chomsky is confusing the map with the territory.” Maps are useful tools for understanding the territory, but not everything that is true of the map is also true of the territory, e.g., you won’t fall off the edge of the world when you walk outside your map.

    Mathematical formalisms are some of the best maps we have. But it’s still easy to think of deductive arguments where the map/territory confusion leads to false claims about the real world. For example, Zeno’s paradox “proves” that a faster object cannot overtake a slower object, which is obviously false. So, what the paradox is really showing us is that a convergent geometric series (map) is a bad way of describing constant motion (territory). In other words, Zeno found the edge of the map for the geometric series.

    The deductive arguments about bounded/unbounded grammars are uncontroversially describing properties of the formalism (i.e. map) we are using. That is, a discrete, non-enumerative system has no way of expressing something that is half-way between bounded and unbounded. But the question is, does Chomsky’s argument actually tell us something about the evolution of language (territory), or has he just reached the edge of our map?

    1. Hi Joe, I don't think I see the metaphor. We humans never have access to `the territory' when we build scientific theories. Only the maps. Our minds aren't set up to directly perceive reality, so we need to build theories that connect to reality in helpful ways for understanding it. Evolutionary theory provides the principles for drawing `maps' of what happened in the history of life on Earth, physical theories provide the wherewithal for drawing (quite bizarre) maps of physical reality, linguistic theories for what knowledge of language humans have, etc. We can't directly understand the world as it is, all we can do is build theories of it that provide us with understanding of that world. The `map' Chomsky provides may be inaccurate, but I don't think there's a better one (personal opinion here) for charting the structures of human language, and, as various people have been saying in this thread, it gives us at least some directions for what happened in the evolution of language, though so little of that territory is accessible to us that the old map marking `here be dragons' may well be true.

    2. Hi David,
      I would actually agree with almost everything you've said, although I don't think it undermines the point of the analogy.

      (Although I realise my wording was a bit lazy in places. I didn't mean to imply that evolutionary theory is ontologically truer than linguistic theory).

      If, as you say, we have access to the maps but never the territory, then it is very hard to know which things are truths about the territory and which things are just artifacts of the way our maps are put together. That makes it very risky to argue, as Chomsky, that something must be there just because it's on one of the maps. As soon as we acknowledge that the map isn't the territory, then we're faced with the possibility that looking for the evolution of unboundedness might be like digging up sand to find the Egyptian border.

      Extending the analogy a bit, another approach might be to cross reference as many maps as possible, which I take be analagous to the point (argued here and elsewhere, I think) that Chomsky's deductive argument from formal theory isn't sufficient: We need the insight from those other disciplines as well.

    3. I think the argument is not that it's on the map so it must be there, it's more that we don't have an alternative account. Well, we do, I guess, something like construction grammar. But that gives us no real grip on the fact of generativity, I think, for reasons I've adumbrated elsewhere. So we don't really have an alternative map that gives us a better perspective on the territory. But I would agree, and Chomsky has been saying for decades, that all sorts of evidence are useful for understanding a phenomenon. No one, I think, is disputing that.