Shravan Vasishth's Slog (Statistics blog)

Thursday, November 12, 2020

New paper: A computational evaluation of two models of retrieval processes in sentence processing in aphasia

Here is another important paper from my lab, led by my PhD student Paula Lissón, with a long list of co-authors.

This paper, which also depends heavily on the amazing capabilities of Stan, investigates the quantitative predictions of two competing models of retrieval processes, the cue-based retrieval model of Lewis and Vasishth, and the direct-access model of McElree. We have done such an investigation before, in a very exciting paper by Bruno Nicenboim, using self-paced reading data from a German number interference experiment.

What is interesting about this new paper by Paula is that the data come from individuals with aphasia and control participants. Such data is extremely difficult to collect, and as a result many papers report experimental results from a handful of people with aphasia, sometimes as few as 7 people. This paper has much more data, thanks to the hard work of David Caplan.

The big achievements of this paper are that it provides a principled approach to comparing the two competing models' predictions, and it derives testable predictions (which we are about to evaluate with new data from German individuals with aphasia---watch this space). As is always the case in psycholinguistics, even with this relatively large data-set, there just isn't enough data to draw unequivocal inferences. Our policy in my lab is to be upfront about the ambiguities inherent in the inferences. This kind of ambiguous conclusion tends to upset reviewers, because they expect (rather, demand) big-news results. But big news is, more often than not, just illusions of certainty, noise that looks like a signal (see some of my recent papers in the Journal of Memory and Language). We could easily have over-dramatized the paper and dressed it up to say way more than is warranted by the analyses. Our goal here was to tell the story with all its uncertainties laid bare. The more papers one can put out there that make more measured claims, with all the limitations laid out openly, the easier it will be for reviewers (and editors!) to learn to accept that one can learn something important from a modeling exercise without necessarily obtaining a decisive result.

Download the paper from here: https://psyarxiv.com/r7dn5

A computational evaluation of two models of retrieval processes in sentence processing in aphasia

Abstract:

Can sentence comprehension impairments in aphasia be explained by difficulties arising from dependency completion processes in parsing? Two distinct models of dependency completion difficulty are investigated, the Lewis and Vasishth (2005) activation-based model, and the direct-access model (McElree, 2000). These models’ predictive performance is compared using data from individuals with aphasia (IWAs) and control participants. The data are from a self-paced listening task involving subject and object relative clauses. The relative predictive performance of the models is evaluated using k-fold cross validation. For both IWAs and controls, the activation model furnishes a somewhat better quantitative fit to the data than the direct-access model. Model comparison using Bayes factors shows that, assuming an activation-based model, intermittent deficiencies may be the best explanation for the cause of impairments in IWAs. This is the first computational evaluation of different models of dependency completion using data from impaired and unimpaired individuals. This evaluation develops a systematic approach that can be used to quantitatively compare the predictions of competing models of language processing.

Wednesday, November 11, 2020

New paper: Modeling misretrieval and feature substitution in agreement attraction: A computational evaluation

This is an important new paper from our lab, led by Dario Paape, and with Serine Avetisyan, Sol Lago, and myself as co-authors.

One thing that this paper accomplishes is that it showcases the incredible expressive power of Stan, a probabilistic programming language developed by Andrew Gelman and colleagues at Columbia for Bayesian modeling. Stan allows us to implement relatively complex process models of sentence processing and test their performance against data. Paape et al show how we can quantitatively evaluate the predictions of different competing models. There are plenty of papers out there that test different theories of encoding interference. What's revolutionary about this approach is that one is forced to make a commitment about one's theories; no more vague hand gestures. The limitations of what one can learn from data and from the models is always going to be an issue---one never has enough data, even when people think they do. But in our paper we are completely upfront about the limitations; and all code and data are available at https://osf.io/ykjg7/ for the reader to look at, investigate, and build upon on their own.

Download the paper from here: https://psyarxiv.com/957e3/

Modeling misretrieval and feature substitution in agreement attraction: A computational evaluation

Abstract

We present a self-paced reading study investigating attraction effects on number agreement in Eastern Armenian. Both word-by-word reading times and open-ended responses to sentence-final comprehension questions were collected, allowing us to relate reading times and sentence interpretations on a trial-by-trial basis. Results indicate that readers sometimes misinterpret the number feature of the subject in agreement attraction configurations, which is in line with agreement attraction being due to memory encoding errors. Our data also show that readers sometimes misassign the thematic roles of the critical verb. While such a tendency is principally in line with agreement attraction being due to incorrect memory retrievals, the specific pattern observed in our data is not predicted by existing models. We implement four computational models of agreement attraction in a Bayesian framework, finding that our data are better accounted for by an encoding-based model of agreement attraction, rather than a retrieval-based model. A novel contribution of our computational modeling is the finding that the best predictive fit to our data comes from a model that allows number features from the verb to overwrite number features on noun phrases during encoding.

Tuesday, November 10, 2020

Is it possible to write an honest psycholinguistics paper?

I'm teaching a new course this semester: Case Studies in Statistical and Computational Modeling. The idea is to revisit published papers and the associated data and code from the paper, and p-hack the paper creatively to get whatever result you like. Yesterday I demonstrated that we could conclude whatever we liked from a recent paper that we had published; all conclusions (effect present, effect absent) were valid under different assumptions! The broader goal is to demonstrate how researcher degrees of freedom play out in real life.

Then someone asked me this question in the class:

Is it possible to write an honest psycholinguistics paper?

The short answer is: yes, but you have to accept that some editors will reject your paper. If you can live with that, it's possible to be completely honest.

Usually, the only way to get a paper into a major journal is to make totally overblown claims that are completely unsupported or only very weakly supported by the data. If your p-value is 0.06 but you want to claim it is significant, you have several options: mess around with the data till you push it below 0.05. Or claim "marginal significance". Or you can bury that result and keep redoing the experiment till it works. Or run the experiment till you get significance. There are plenty of tricks out there.

If you got super-duper low p-values, you are on a good path to a top publication; in fact, if you have any significant p-values (relevant to the question or not) you are on a good path to publication, because reviewers are impressed with p<0.05 somewhere, anywhere, in a table. That's why you will see huge tables in psychology articles, with tons and tons of p-values; the sheer force of low p-values spread out over a gigantic table can convince the reviewer to accept the paper, even though only a single cell among dozens or hundreds in that table is actually testing the hypothesis. You can rely on the fact that nobody will think to ask whether power was low (the answer is usually yes), and how many comparisons were done.

Here are some examples of successes and failures, i.e., situations where we honestly reported what we found and were either summarily rejected or (perhaps surprisingly) accepted.

For example, in the following paper,

Shravan Vasishth, Daniela Mertzen, Lena A. Jäger, and Andrew Gelman. The statistical significance filter leads to overoptimistic expectations of replicability. Journal of Memory and Language, 103:151-175, 2018.

I wrote the following conclusion:

"In conclusion, in this 100-participant study we don’t see any grounds for claiming an interaction between Load and Distance. The most that we can conclude is that the data are consistent with memory-based accounts such as the Dependency Locality Theory (Gibson, 2000), which predict increased processing difficulty when subject-verb distance is increased. However, this Distance effect yields estimates that are also consistent with our posited null region; so the evidence for the Distance effect cannot be considered convincing."

Normally, such a tentative statement would lead to a rejection. E.g., here is a statement in another paper that led to a desk rejection (same editor) in the same journal where the above paper was published:

"In sum, taken together, Experiment 1 and 2 furnish some weak evidence for an interference effect, and only at the embedded auxiliary verb."

We published the above (rejected) paper in Cognitive Science instead.

In another example, both the key effects discussed in this paper would have technically been non-significant had we done a frequentist analysis. The fact that we interpreted the Bayesian credible intervals with reference to a model's quantitative predictions doesn't change that detail. However, the paper was accepted:

Lena A. Jäger, Daniela Mertzen, Julie A. Van Dyke, and Shravan Vasishth. Interference patterns in subject-verb agreement and reflexives revisited: A large-sample study. Journal of Memory and Language, 111, 2020.

In the above paper, we were pretty clear about the fact that we didn't manage to achieve high enough power even in our large-sample study: Table A1 shows that for the critical effect we were studying, we probably had power between 25 and 69 percent, which is not dramatically high.

There are many other such examples from my lab, of papers accepted despite tentative claims, and papers rejected because of tentative claims. In spite of the rejections, my plan is to continue telling the story like it is, with a limitations section. My hope is that editors will eventually understand the following point:

Almost no paper in psycholinguistics is going to give you a decisive result (it doesn't matter what the p-values are). So, rejecting a paper on the grounds that it isn't reporting a conclusive result is based on a misunderstanding about what we learnt from that paper. We almost never have conclusive results, even when we claim we do. Once people realize that, they will become more comfortable accepting more realistic conclusions from data.