Thursday, February 05, 2015

Quantitative methods in linguistics: The danger ahead

Peter Hagoort has written a nice piece on his take on the future of linguistics:

He's very gentle on linguists in this piece. One of his suggestions is to do proper experimental research instead of relying on intuition. Indeed, the field of linguistics is already moving in that direction. I want to point out a potentially dangerous consequence of the move towards quantitative methods in linguistics.

My expectation is that with the arrival of more and more quantitative work in linguistics, we are going to see (actually, we are already there) a different kind of degradation in the quality of work done. This degradation will be different from the kind linguistics has already experienced thanks to the tyranny of intuition in theory-building.

Here are some things that I have personally seen linguists do (and psycholinguists do this too, even though they should know better!):

1. Run an experiment until you hit significance. ("Is the result non-significant? Just run more subjects; it's going in the right direction.")
2. Alternatively, if you are looking to prove the null hypothesis, stop early or just run a low power study, where the probability of finding an effect is nice and low.
3. Run dozens (in ERP, even more than dozens) of tests and declare significance at 0.05.
4. Vary the region of interest post-hoc to get significance.
5. Never check model assumptions.
6. Never replicate results.
7. Don't release data and code with your publication.
8. Remove data as needed to get below the 0.05 threshold.
9. Only look for evidence in favor of your theory; never publish against your own theoretical position.
10. Argue from null results that you actually found that there is no effect.
11. Reverse-engineer your predictions post-hoc after the results show something unexpected.

I could go on. The central problem is that doing experiments requires a strong grounding in statistical theory. But linguists (and psycholinguists) are pretty cavalier about acquiring the relevant background: have button, will click. No linguist would think of running his sentences through some software to print out his formal analyses; you need to have expert knowledge to do linguistics. But the same linguist will happily gather rating data and run some scripts or press some buttons to get an illusion of quantitative rigor. I wonder why people think that statistical analysis is exempt from the deep background so necessary for doing linguistics.  Many people tell me that they don't have the time to study statistics. But the statistics is the science. If you're not willing to put in the time, don't use statistics!

I suppose I should be giving specific examples here; but that would just insult a bunch of people and would distract us from the main point, which is that the move to doing quantitative work in linguistics has a good chance of backfiring and leading to a false sense of security that we've found something "real" about language.

I can offer one real example of a person I don't mind insulting: myself. I have made many, possibly all, of the mistakes I list above. I started out with formal syntax and semantics, and transitioned to doing experiments in 2000. Everything I knew about statistical analysis I learnt from a four-week course I did at Ohio State.  I discovered R by googling for alternatives to SPSS and Excel, which had by then given me RSI. I had the opportunity to go over to the Statistics department to take courses there, but I missed that chance because I didn't understand how deep my ignorance was.  The only reason I didn't make a complete fool of myself in my PhD was that I had the good sense to go to the Statistical Consulting section of OSU's Stats department, where they introduced me to linear mixed models ("why are you fitting repeated measures ANOVAs? Use nlme.").  It was after I did a one-year course in Sheffield's Statistics department that I finally started to see what I had missed (I reviewed this course here).
For linguistics, becoming a quantitative discipline is not going to give us the payoff that people expect, unless we systematically work at making a formal statistical education a core part of the curriculum. Currently, what's happening is that we have advanced fast in using experimental methods, but have made little progress in developing a solid understanding of statistical inference.

Obviously, not everyone who uses experimental methods in linguistics falls into this category. But the problems are serious, both in linguistics (and psycholinguistics), and it's better to recognize this now rather than let thousands of badly done experiments and analyses lead us down some other garden-path.


Colin Phillips said...

I'm surprised that Peter again latched onto the quantitative methods meme in his piece, and I think that the danger lies elsewhere than you place it.

The claim that "We would take you linguists seriously if only you used our methods" is just false. Our group sees this all the time. We apply quantitative approaches and schmancy experimental tools to phenomena that linguists find (mildly) interesting. This does not elicit the reaction "Oh, now we see that this linguistic phenomenon [insert your favorite here] is real and should be taken seriously." Of course not. The real reason why the phenomena are ignored is that either we do a poor job of explaining them; or folks simply believe that the key to language lies elsewhere (that's their prerogative, we all place our bets); or they see the relevance, but don't feel comfortable getting involved themselves. The claim that "if only you used our methods we'd take you seriously" is just untrue.

And I disagree that there's grave danger in misapplication of statistical methods to acceptability judgments. I get abstracts to review on this quite frequently nowadays, and that's rarely an issue. Why? Because the stats in most acceptability rating studies are rarely subtle. (That's different than online measures.) The concerns are more commonly: (i) emphasis on the numbers blocks out analysis of what-it-all-means; (ii) lack of link between numbers and representational conclusions -- just because you can get a small-but-reliable acceptability contrast between A and B doesn't mean that you've uncovered that A is grammatical and B is ungrammatical.

Hagoort's other diagnoses are perhaps more relevant. Though his proposed remedies are not obviously related to the diagnoses.

Shravan Vasishth said...

I agree that using "psycholinguistic" methods is not the only reason for taking linguists seriously.

But that's not my point. My point about the dangers ahead also did not get across.

Colin, you say that misapplication is rarely a problem in acceptability rating studies. Are there cases where it is a problem? Some problems I have seen in rating studies:

-studies in which the rating scale is 1-3, and treated as a continuous measure.
-low-power studies arguing in favor of null results
-running experiments until significance is reached
-not checking model assumptions, leading to invalid inference
-not replicating results to check for consistency of outcome

Even experienced psycholinguists make these errors. These things compromise the integrity of the use of statistical tools and constitute misuse of statistical methods. That's the danger I am referring to. In psycholinguistics and psychology we are already there as regards misuse.

I'm only saying that to the extent that a greater use of experimental methods is needed in linguistics, linguistics should adopt better standards and learn from the mistakes of psychology and psycholinguistics.

Colin Phillips said...

I'm sure that you can find plenty of cases where the statistical analyses used in an acceptability rating study would not get an A+ in stats class. I just don't think that this is where the danger lies. This is specifically a claim about acceptability rating studies, which typically just aren't terribly subtle. The danger is that so much attention is paid to the stats -- and people get so worried about being brow-beaten by folks like you -- that it sucks attention away from more important issues that are harder to codify in a rule book. Like: why are we testing this contrast? What does the contrast tell us about representations? Were the materials well designed? Does our theory predict the effect size that we're seeing? Etc. etc.

Stats *look* intimidating, but they're the easy part. They draw attention away from the harder part of the inferential process.

Shravan Vasishth said...

But Colin, it's exactly the kind of dismissive attitude towards statistics that I'm warning against :).

The statistics *is* the science. It's the clean separation of statistics from linguistics that you make that I'm suggesting should end. Questions that you consider non-statistical, like "why are we testing this contrast? What does the contrast tell us about representations? Were the materials well designed? Does our theory predict the effect size that we're seeing?" are all integral to statistical training: the design of experiments, the setting up of the analyses, the interpretation in context.

"Stats *look* intimidating, but they're the easy part. They draw attention away from the harder part of the inferential process."

Well, I don't know what you mean by stats vs inference. For me they are the same thing. I'm saying that apparently statistics is not the easy part, because what you refer to as brow-beating are fatal problems in the *inferential process* that you are ultimately interested in. My objections are not just for fun and because I like statistics per se. By just blowing off objections to the inferential process as "brow-beating" and (by implication) distracting from the real work of science is just a way to avoid dealing with these issues. I am suggesting that we should just acknowledge that we do this (as I wrote, I have made many of these mistakes) and stop making them.

The reason I "brow-beat" people is not because I want them to do the stats right. I want them to not draw invalid inferences because it wastes everyone's time. If someone were to point out a flaw in reasoning in a linguistic analysis, you wouldn't call it brow-beating, that would be called a reasoned peer review. If it's about what can be inferred from the statistics, that becomes "brow-beating", because it's perceived as lying outside the core of what we do in linguistics. I'm saying that we should just integrate statistical training into the core of the curriculum in linguistics.

A common example is arguing for the null hypothesis based on low power. The stats are easy, sure...if one is aware of the issues. Otherwise, they are hard, and apparently that's the case currently in linguistics and psycholinguistics and psychology.

Another example is sequential testing, running till we hit significance, and not correcting for that.

Another example is publishing only what fits the story.

The way to deal with all this in stats is easy, I agree with you. But it's not what we do in our field.

My point is only that it's not that hard to fix these kinds of easy problems.

Colin Phillips said...

(A couple of quick reminders at the outset, in case anybody is watching this: I’m not saying that statistics are a waste of time. (Do good stats, eat vegetables, and don’t forget to floss regularly. All part of a healthy lifestyle.) Nor am I saying that there’s a distinction between “real” linguistic analysis and statistical analysis. Nor am I saying that stats are never the crux of the problem. Any of those claims would be going far beyond what I wrote.)

The specific claim was this: in acceptability judgment studies, the stats is rarely where the action lies. This is just my impression based on doing lots of them myself (we’ve run 50+ such studies in the past few years), and based on reading lots of abstracts that report such studies. In most cases we see one of the following two scenarios: (i) the contrasts are so big that one barely needs to run statistical tests. (Sure, feel free to include them, but they’re not terribly informative.) Or (ii) the contrasts are so subtle that the stats don’t really address the question at hand. If the investigator is asking the question: “Is X well-formed and Y ill-formed?”, then the response “X elicits a barely noticeable but statistically reliable increase in ratings” is not really an answer to the question under discussion, however perfectly the quantitative analyses might have been carried out.

Yes, there are cases of acceptability studies where the stats matter more, and where they are commensurate with the hypotheses being tested. We’ve done some studies where that really matters, e.g., in our studies on island effects and individual differences (Sprouse, Wagers, & Phillips, 2012) we’re testing predictions that it’s just impossible to test via individual intuitions. The same is the case in Ted Gibson’s studies on co-variation among acceptability ratings across different constructions. But most of the studies that we do, and most studies that I read about (again, I’m talking just about acceptability ratings here) are far simpler.

So a suitable rejoinder to my remark is not, “but stats so do matter!” As a general statement I agree. A more relevant rejoinder would be: “No, acceptability judgment studies generally do fine in terms of interpreting the (non-)effects that they see; the most frequent problem in such studies lies in whether the (non-)effects are statistically justified."

Stats vs. further inference: what’s the difference?

Example 1: acceptability rating study, comparing X and Y
Stats question: are ratings for X and Y reliably different?
Further inference: X is grammatically well-formed and Y is ill-formed.

Example 2: analyses of fixation patterns in a visual world study (display with 2 objects)
Stats question: do comprehenders look to object X reliably earlier when it is preceded by a disambiguating gender-marked determiner?
Further inference: comprehenders are great at using gender cues to predict upcoming nouns (outside a closed 2-object world).

In both instances, I worry more about how we generalize from the quantitative analyses than I do about the quantitative analyses themselves.

Statistics elicits a lot of anxiety. The rest of the big complex of inferences that we make attracts rather little anxiety. In the one case people are afraid of looking ignorant, or being told that they’re lying, or being accused of scientific malpractice. In the other case folks just shrug it off. That mismatch worries me.

Shravan Vasishth said...

Sounds like we are in agreement then. Both statistical reasoning and linguistic reasoning influence the conclusions you can draw.

My original post was intended to communicate to linguists that they should not make the same mistakes psycholinguists and psychologists have historically made. They should study statistical theory and practice with the same seriousness they bring to linguistic theory. Statistics is not an add-on or afterthought but an integral part of doing science. We should teach the methods better than we do so that future generations can make better use of the tools available.

Chris Brew said...

Very reasonable discussion. What should for completeness be added is that there are few (if any) scientific fields where people use statistics well. Psychology has a particular problem with the use of "cookbook" methods for well-established experimental methods. The predominance of ANOVA is due to the fact that cognitive psychologists are used to it, and don't typically think hard when they see it used in a paper. It is a culturally approved way of satisfy the demand for statistical analysis in cross-modal priming papers, and by incorrect analogy, all papers. This is usually harmless, occasionally disastrous.

Shravan Vasishth said...

I think epidemiology might be an exception as a field, but not sure. Maybe also experimental physics? But I can't think of any other field, so I guess you are right.

Using ANOVA in a world where linear mixed effects models; the only reason to do that any more is to somehow squeeze out a p<0.050 result from your analysis. Oh, and cases where LMMs are not possible to use (which happens sometimes).

Titus von der Malsburg said...

My experience with data from judgement studies is that there can be complications that are even harder to solve than those found in analyses of reading/reaction times. One issue is that the differences between the levels of the Likert scale are typically not (psychologically) equal and sometimes not even symmetric (i.e. the difference between level 1 and 2 can be bigger or smaller than that between 6 and 7). This is not an issue when just two conditions are compared, but it substantially complicates the evaluation of interactions. Another issue is homogeneity of variances which is often not given. I encountered one case where the effect of the manipulation was not expressed in a difference of the means but in a substantial difference in variances while the means were almost same. A simple test would not have caught this effect and the conclusion would have been that there is nothing going on. Potential solutions to problems like those described above are ordered logistic regression and beta regression but it is not always clear when to use which because they both address only a part of the problems. Bottom line: unless the experimental design is completely trivial, the stats are potentially tricky even if the effects are large.