Monday, August 09, 2021

A common mistake in psychology and psycholinguistic papers: Subsetting data to carry out an analysis

A Common Mistake in Data Analysis (in Psychology/Linguistics): Subsetting data to carry out nested analyses (Part 1 of 2)

tl;dr

If you subset the data to analyze effects within one level of a two- or three-level factor, you will usually get misleading results in your null hypothesis significance test. The reason: by subsetting data, you are artificially reducing and/or misestimating the different sources of variance.

To understand how to do these kinds of analyses correctly, read:

Daniel J. Schad, Shravan Vasishth, Sven Hohenstein, and Reinhold Kliegl. How to capitalize on a priori contrasts in linear (mixed) models: A tutorial. Journal of Memory and Language, 110, 2020. Code: https://osf.io/7ukf6/

Introduction

A very common mistake I see in psycholinguistics and psychology papers is subsetting the data to carry out an analysis. The reason people do this is so that they can use canned repeated measures ANOVA functions. However, such subsetting has some very interesting consequences: effects that may not actually be statistically significant will become significant. This mistake has the potential to seriously mislead people (and that’s the majority of psychologists and psycholinguists) who develop theories exclusively based on whether an effect is statistically significant or not.

Of course, using significance as a criterion for developing theory is usually a nonsensical thing to do in the first place, but let’s ignore that issue for now and buy into the fiction that finding significance is a meaningful activity.

I will discuss two examples; the first in this post, and the second in the next post (coming soon). In both examples, I should stress that there is no implication that the authors did anything dishonest—they did their analyses in good faith. The broader problem is that in psychology and linguistics, we are rarely taught much about data analysis. We usually learn a canned cookbook style of analysis. As a consequence, we often end up ignoring model assumptions, with fatal consequences. 10 years ago, I would probably have made the same mistakes as in the two data sets below.

To the credit of the authors, they released all their data into the public domain; that is a huge thing. My experience is that only about 25% of researchers release their data–most people outright refuse (sometimes very rudely! :) to make the data available.

Example 1: Swets et al 2008, in Memory and Cognition

The paper we consider first is:

Swets, B., Desmet, T., Clifton, C., & Ferreira, F. (2008). Underspecification of syntactic ambiguities: Evidence from self-paced reading. Memory & Cognition, 36(1), 201-216.

This paper is an influential and important one in psycholinguistics. It has been cited some 263 times according to google scholar. The central claim that the paper makes is that when a sentence has a globally ambiguous syntactic attachment, reading time (this is the self-paced reading method) is faster compared to unambiguous baseline conditions when the language comprehension task is superficial. When the comprehension task involves deep processing, this ambiguity advantage disappears. The experiment design is as follows:

There are three syntactic attachment types (a within subjects factor):

Ambiguous The maid of the princess who scratched herself in public was terribly humiliated.
N1 attachment The son of the princess who scratched himself in public was terribly humiliated.
N2 attachment The son of the princess who scratched herself in public was terribly humiliated.

The critical region where the interesting action happens is the post-critical region, the phrase in public following the reflexive (himself/herself).

There are three other levels of another, between-subject factor: question type (qtype). After reading each sentence, different subjects were shown either questions about the relative clause (RC questions–this is the deep processing condition), superficial questions, or were asked questions only occasionally.

Thus, this is a 3x3 factorial design, with one within-subjects factor (called attachment), and one between-subjects factor (called qtype).

We expect an interaction between the attachment and qtype factors. Let’s see how the evidence for this interaction was reported in the paper, and where things go wrong.

First, load the data:

## install from: https://github.com/bnicenboim/bcogsci as follows:
## # install.packages("devtools")
## devtools::install_github("bnicenboim/bcogsci")
library(bcogsci)
data("df_swets08")

The data frame for the post-critical region looks like this:

head(df_swets08)

##       item subj resp.RT        qtype    attachment   RT
## 41473    1    6    2089 RC questions N2 attachment 2379
## 41474    1  104    1831   occasional     ambiguous  946
## 41475    1   94    2252 RC questions N1 attachment 1083
## 41476    1  150    4941 RC questions N1 attachment 1342
## 41477    1  132    6954 RC questions N1 attachment 1489
## 41478    1  103     472   occasional     ambiguous 1400

The dependent measure is RT (reading time); resp.RT is the question response time. We will ignore the latter measure here.

A barplot shows the expected interaction pattern:

means<-round(with(df_swets08,tapply(RT,
                                    IND=list(attachment,qtype),mean)))
barplot(means,beside=TRUE)

It does look like the qtype x ambiguity interaction will hold up–there seems to be a difference in the relative heights between the three barplots for qtype.

In preparation for a linear mixed models analysis, we set up orthogonal contrast coding (Helmert contrasts). The idea here is to compare the following groups of conditions:

The ambiguous vs the unambiguous conditions (amb)
The two unambiguous conditions (att)
The deep vs the shallow questions types (depth)
The two shallow question types (shallow)

## helmert coding for attachment:
df_swets08$ambig<-ifelse(df_swets08$attachment=="ambiguous",2,-1)
df_swets08$att<-ifelse(df_swets08$attachment=="N2 attachment",-1,
                 ifelse(df_swets08$attachment=="N1 attachment",1,
                        0))
## helmert coding for depth of processing:
df_swets08$depth<-ifelse(df_swets08$qtype=="RC questions",2,-1)
df_swets08$shallow<-ifelse(df_swets08$qtype=="occasional",-1,
                     ifelse(df_swets08$qtype=="superficial",1,0))

This gives us several new columns, which will be used to fit a linear mixed model:

head(df_swets08)

##       item subj resp.RT        qtype    attachment   RT ambig att depth shallow
## 41473    1    6    2089 RC questions N2 attachment 2379    -1  -1     2       0
## 41474    1  104    1831   occasional     ambiguous  946     2   0    -1      -1
## 41475    1   94    2252 RC questions N1 attachment 1083    -1   1     2       0
## 41476    1  150    4941 RC questions N1 attachment 1342    -1   1     2       0
## 41477    1  132    6954 RC questions N1 attachment 1489    -1   1     2       0
## 41478    1  103     472   occasional     ambiguous 1400     2   0    -1      -1

## sanity check: is the contrast coding correct?
xtabs(~attachment+ambig,df_swets08)

##                ambig
## attachment        -1    2
##   ambiguous        0 1728
##   N1 attachment 1728    0
##   N2 attachment 1728    0

xtabs(~attachment+att,df_swets08)

##                att
## attachment        -1    0    1
##   ambiguous        0 1728    0
##   N1 attachment    0    0 1728
##   N2 attachment 1728    0    0

xtabs(~qtype+depth,df_swets08)

##               depth
## qtype            -1    2
##   occasional   1728    0
##   RC questions    0 1728
##   superficial  1728    0

xtabs(~qtype+shallow,df_swets08)

##               shallow
## qtype            -1    0    1
##   occasional   1728    0    0
##   RC questions    0 1728    0
##   superficial     0    0 1728

We will use this coding below.

OK, now we are ready to go. First, the standard ANOVA analysis, then the LMM analysis.

Investigating the higher-order interaction using ANOVA vs LMMs

Next, we use a repeated measures ANOVA and then fit a linear mixed model, looking at main effects and interactions. First, we fit a model with raw reading times (this obviously the wrong thing to do, but that’s the dependent measure used in the published paper).

ANOVA analysis for the higher order interaction

bysubjdf_swets08<-aggregate(RT~subj+attachment + 
                        qtype,mean,data=df_swets08)
library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

res_anova<-anova_test(data = bysubjdf_swets08, 
           dv = RT, 
           wid = subj,
           between = qtype, 
           within = attachment
  )
get_anova_table(res_anova)

## ANOVA Table (type II tests)
## 
##             Effect  DFn    DFd     F        p p<.05   ges
## 1            qtype 2.00 141.00 5.290 0.006000     * 0.054
## 2       attachment 1.80 253.18 8.496 0.000458     * 0.014
## 3 qtype:attachment 3.59 253.18 2.972 0.024000     * 0.010

This looks great; we have the expected interaction. But if we log-transform the aggregated data, the interaction is gone!!!

bysubjdf_swets08$logrt<-log(bysubjdf_swets08$RT)

res_anovalog<-anova_test(data = bysubjdf_swets08, 
           dv = logrt, 
           wid = subj,
           between = qtype, 
           within = attachment
  )
get_anova_table(res_anovalog)

## ANOVA Table (type II tests)
## 
##             Effect DFn DFd      F        p p<.05   ges
## 1            qtype   2 141  5.163 7.00e-03     * 0.057
## 2       attachment   2 282 10.158 5.49e-05     * 0.013
## 3 qtype:attachment   4 282  2.148 7.50e-02       0.005

The effect disappears because the significant interaction is due to a few extreme values, which the log transform down-weights.

This is really bad news, because it means that there is really no evidence in this paper for an ambiguity advantage.

Now, if you are a psychologist, you are probably feeling outraged: “Hey, cognition happens on the millisecond scale!!! You cannot log-transform the data!”. To which I would respond: (a) the Normal likelihood model you assume will predict negative reading times; are you OK with that prediction?, and (b) try explaining your logic to a real statistician (good luck, you will need it). For me, it’s amusing to watch people hold forth confidently on the importance of not log-transforming reading time data.

Linear mixed models analysis for the higher order interaction

Next, we fit a linear mixed model. For the Swets et al claim to hold up, there would have to be an interaction between ambig (whether the RC attachment is ambiguous or not) and depth (whether the question type was deep or not).

There is no such interaction, even when one fits the simplest linear mixed models of all (varying intercepts only).

library(lme4)

## Loading required package: Matrix

m1<-lmer(RT ~ (ambig+att)*(depth + shallow) + (1|subj)+
          (1|item),df_swets08)

## the above is equivalent to:
m1<-lmer(RT~ambig+depth + ambig:depth +att:depth + shallow+ ambig:shallow + att:shallow + (1|subj)+
          (1|item),df_swets08)

m1NULL<-lmer(RT~ambig+depth + #ambig:depth 
             att:depth + shallow+ ambig:shallow + att:shallow + (1|subj)+
          (1|item),df_swets08)

anova(m1,m1NULL)

## refitting model(s) with ML (instead of REML)

## Data: df_swets08
## Models:
## m1NULL: RT ~ ambig + depth + att:depth + shallow + ambig:shallow + att:shallow + (1 | subj) + (1 | item)
## m1: RT ~ ambig + depth + ambig:depth + att:depth + shallow + ambig:shallow + att:shallow + (1 | subj) + (1 | item)
##        npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)
## m1NULL   10 81342 81408 -40661    81322                     
## m1       11 81344 81416 -40661    81322 0.2263  1     0.6343

There is a better analysis, on the log scale, but there is still no evidence for an interaction. I skip that analysis here.

So, even with a raw RT analysis, there is no evidence for a ambiguity:depth interaction in these data. This is what usually happens to me when I analyze published data; I can only rarely get to the same conclusion as in the published data.

But this was just a sanity check, let’s get to the subset analysis next. That’s the main issue I want to discuss here.

Subset analyses

The next thing to look at is whether there an effect of ambiguity nested within the question types: within RC questions vs the non-RC questions, is there an effect of ambiguity?

In the paper, the authors make the following claims:

“…in the superficial question conditions, participants read ambiguous sentences faster than disambiguated sentences, and no reading time differences were observed for N1 versus N2 disambiguation.”

For this we needed a nested contrast coding: Within RC questions, the effect of ambiguity and attachment, and within the other question types, the effect of ambiguity and attachment.

        Question type:    RC      RC      RC    Super  Super   Super  Occ     Occ     Occ 
        Sentence type:    A       N1      N2    A       N1      N2    A       N1      N2 
RCambig                   2       -1      -1    0       0       0     0       0       0
RCatt                     0       1       -1    0       0       0     0       0       0
Sambig                    0       0        0    2       1       -1    0        0      0
Satt                      0       0        0    0        1     -1     0        0      0
Oambig                    0       0        0    0       0       0     2       -1      1 
Oatt                      0       0        0    0       0       0     0        1     -1 
RC                    2       2        2    -1      -1      -1    -1       -1     -1
NonRC                    0       0        0    1        1      1     -1       -1     -1

Here, we have three pairs of nested comparison, for each of the three question types (RC (relative clause questions), O(ccasional), S(uperficial)): the ambiguity effects (the ambiguous condition vs the mean of N1/N2 attachment), and the N1 vs. N2 attachment effect. The contrast RC refers to the effect of the question type RC questions with the average of the other two question types; and NonRC compares the superficial and occasional question type conditions/

Here is the nested coding:

df_swets08$RCambig<-ifelse(df_swets08$qtype=="RC questions" & df_swets08$attachment=="ambiguous", 2,
             ifelse(df_swets08$qtype=="RC questions" & 
                      df_swets08$attachment!="ambiguous", -1,0))
df_swets08$RCatt<-ifelse(df_swets08$qtype=="RC questions" & df_swets08$attachment=="N1 attachment", 1,ifelse(df_swets08$qtype=="RC questions" & 
                      df_swets08$attachment=="N1 attachment", -1,0))

df_swets08$Sambig<-ifelse(df_swets08$qtype=="superficial" & df_swets08$attachment=="ambiguous", 2,
             ifelse(df_swets08$qtype=="superficial" & 
                      df_swets08$attachment!="ambiguous", -1,0))
df_swets08$Satt<-ifelse(df_swets08$qtype=="superficial" & 
                        df_swets08$attachment=="N1 attachment", 1,ifelse(df_swets08$qtype=="superficial" & 
                      df_swets08$attachment=="N1 attachment", -1,0))

df_swets08$Oambig<-ifelse(df_swets08$qtype=="occasional" & df_swets08$attachment=="ambiguous", 2,
             ifelse(df_swets08$qtype=="occasional" & 
                      df_swets08$attachment!="ambiguous", -1,0))
df_swets08$Oatt<-ifelse(df_swets08$qtype=="occasional" & df_swets08$attachment=="N1 attachment", 1,ifelse(df_swets08$qtype=="occasional" & 
                      df_swets08$attachment=="N1 attachment", -1,0))
df_swets08$RC<-ifelse(df_swets08$qtype=="RC questions",2,-1)
df_swets08$NonRC<-ifelse(df_swets08$qtype=="superficial",1,
                ifelse(df_swets08$qtype=="occasional",-1,0))

ANOVA analysis (incorrect)

The way Swets et al analyzed the data was by subsetting the data to the superficial-questions condition. But this approach drastically changes the amount of data available for computing the most important variance component: the standard deviation estimate of the residuals. The aggregation is also wiping out by item variance (although the authors did do a by item analysis, that’s still not good enough as we need both variance components–by subject and by item–in the model simultaneously, otherwise we will underestimate the variance).

superficial<-subset(df_swets08,qtype="superficial")

bysubjsup<-aggregate(RT~subj+attachment,mean,
                     data=superficial)
res_anovasup<-anova_test(data = bysubjsup, 
           dv = RT, 
           wid = subj,
           within = attachment
  )
get_anova_table(res_anovasup)

## ANOVA Table (type III tests)
## 
##       Effect  DFn    DFd     F        p p<.05   ges
## 1 attachment 1.77 253.04 8.268 0.000595     * 0.013

Here, we get a significant effect of attachment in the superficial conditions. Looks good, right? Wrong.

Analysis using LMMs: subsetted vs full data comparison

Here is the analysis with the full data using nested coding. I fit the most complex model that converged.

m_nested<-lmer(RT~RCambig+RCatt+Sambig+Satt+Oambig+Oatt+
                 RC+NonRC+(1+RCambig+RCatt||subj)+
                 (1+RCambig+RCatt||item),df_swets08)
#summary(m_nested)

## ANOVA test on the overall effect of ambiguity in Superficial:
m_nestedNULL<-lmer(RT~RCambig+RCatt+Satt+Oambig+Oatt+
                 RC+NonRC+(1+RCambig+RCatt||subj)+
                   (1+RCambig+RCatt||item),df_swets08)
anova(m_nested,m_nestedNULL)

## refitting model(s) with ML (instead of REML)

## Data: df_swets08
## Models:
## m_nestedNULL: RT ~ RCambig + RCatt + Satt + Oambig + Oatt + RC + NonRC + ((1 | subj) + (0 + RCambig | subj) + (0 + RCatt | subj)) + ((1 | item) + (0 + RCambig | item) + (0 + RCatt | item))
## m_nested: RT ~ RCambig + RCatt + Sambig + Satt + Oambig + Oatt + RC + NonRC + ((1 | subj) + (0 + RCambig | subj) + (0 + RCatt | subj)) + ((1 | item) + (0 + RCambig | item) + (0 + RCatt | item))
##              npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)
## m_nestedNULL   15 81305 81404 -40638    81275                     
## m_nested       16 81305 81410 -40636    81273 2.5734  1     0.1087

We get a p-value of 0.11!! The effect of ambiguity within superficial conditions is no longer significant!!

Now, suppose we had subset the data to superficial questions only. Let’s redo the above analysis, but subsetting the data:

m_nestedsubset<-lmer(RT~Sambig+Satt+(1|subj)+
                 (1|item),subset(df_swets08,qtype=="superficial"))

## ANOVA test on the overall effect of ambiguity in Superficial:
m_nestedsubsetNULL<-lmer(RT~Satt +(1|subj)+
                   (1|item),subset(df_swets08,qtype=="superficial"))
anova(m_nestedsubset,m_nestedsubsetNULL)

## refitting model(s) with ML (instead of REML)

## Data: subset(df_swets08, qtype == "superficial")
## Models:
## m_nestedsubsetNULL: RT ~ Satt + (1 | subj) + (1 | item)
## m_nestedsubset: RT ~ Sambig + Satt + (1 | subj) + (1 | item)
##                    npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)  
## m_nestedsubsetNULL    5 25827 25854 -12908    25817                       
## m_nestedsubset        6 25823 25856 -12906    25811 5.4424  1    0.01965 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When we subset the data the way Swets et al did, now we get a significant p-value of 0.01!!

Conclusion

If you subset the data to analyze effects within one level of a two- or three-level factor, you will usually get misleading results. The reason: by subsetting data, you are artificially reducing/misestimating the different sources of variance.

The scientific consequence of this subsetting error is that we have now drawn a misleading conclusion—we think we have evidence for underspecification, but there is no evidence here of such an effect. This does not mean that there is no underspecification. There might well be underspecification happening—we just don’t know from these data.

Sunday, August 08, 2021

Podcast interview with me in "Betancourting disaster"

Michael Betancourt is a major force in applied Bayesian statistics. Over the years, he has written a huge number of case studies and tutorials relating to practical aspects of Bayesian modeling using Stan. He has also lectured at our summer school on statistics, which is held annually at Potsdam. He also has a large collection of publicly available talks that are worth watching.

We have collaborated with Michael to produce two really important papers for cognitive scientists:

1. Daniel J. Schad, Michael Betancourt, and Shravan Vasishth. Toward a principled Bayesian workflow: A tutorial for cognitive science. Psychological Methods, 2020. Download here: https://arxiv.org/abs/1904.12765.

2. Daniel J. Schad, Bruno Nicenboim, Paul-Christian Bürkner, Michael Betancourt, and Shravan Vasishth. Workflow Techniques for the Robust Use of Bayes Factors. Available from arXiv:2103.08744v2, 2021. Download here: https://arxiv.org/abs/2103.08744.

He has a podcast, called Betancourting disaster. Michael recently interviewed me, and we talked about the challenges associated with modeling cognitive processes (e.g., reading processes and their interaction with sentence comprehension). You can listen to the whole thing here (it's about an hour-long conversation):

https://www.patreon.com/posts/50550798

Tuesday, June 15, 2021

New paper (Vasishth and Gelman): How to embrace variation and accept uncertainty in linguistic and psycholinguistic data analysis

A new paper, just accepted in the journal Linguistics:

Download: https://psyarxiv.com/zcf8s/

Title: How to embrace variation and accept uncertainty in linguistic and psycholinguistic data analysis

Abstract: The use of statistical inference in linguistics and related areas like psychology typically involves a binary decision: either reject or accept some null hypothesis using statistical significance testing. When statistical power is low, this frequentist data-analytic approach breaks down: null results are uninformative, and effect size estimates associated with significant results are overestimated. Using an example from psycholinguistics, several alternative approaches are demonstrated for reporting inconsistencies between the data and a theoretical prediction. The key here is to focus on committing to a falsifiable prediction, on quantifying uncertainty statistically, and learning to accept the fact that—in almost all practical data analysis situations—we can only draw uncertain conclusions from data, regardless of whether we manage to obtain statistical significance or not. A focus on uncertainty quantification is likely to lead to fewer excessively bold claims that, on closer investigation, may turn out to be not supported by the data.

Friday, May 14, 2021

New Psych Review paper by Max Rabe et al: A Bayesian approach to dynamical modeling of eye-movement control in reading of normal, mirrored, and scrambled texts

An important new paper by Max Rabe, a PhD student in the psychology department at Potsdam:

Open access pdf download: https://psyarxiv.com/nw2pb/

Reproducible code and data: https://osf.io/t9sbf/

Title: A Bayesian approach to dynamical modeling of eye-movement control in reading of normal, mirrored, and scrambled texts

Abstract: In eye-movement control during reading, advanced process-oriented models have been developed to reproduce behavioral data. So far, model complexity and large numbers of model parameters prevented rigorous statistical inference and modeling of interindividual differences. Here we propose a Bayesian approach to both problems for one representative computational model of sentence reading (SWIFT; Engbert et al., Psychological Review, 112, 2005, pp. 777–813). We used experimental data from 36 subjects who read the text in a normal and one of four manipulated text layouts (e.g., mirrored and scrambled letters). The SWIFT model was fitted to subjects and experimental conditions individually to investigate between-subject variability. Based on posterior distributions of model parameters, fixation probabilities and durations are reliably recovered from simulated data and reproduced for withheld empirical data, at both the experimental condition and subject levels. A subsequent statistical analysis of model parameters across reading conditions generates model-driven explanations for observable effects between conditions.

Sunday, May 09, 2021

Two important new papers from my lab on lossy compression, encoding, and retrieval interference

My student Himanshu Yadav is on a roll; he has written two very interesting papers investigating alternative models of similarity-based interference.

The first one will appear in the Cognitive Science proceedings:

Download: https://psyarxiv.com/76aex/

Title: Feature encoding modulates cue-based retrieval: Modeling interference effects in both grammatical and ungrammatical sentences

Abstract: Studies on similarity-based interference in subject-verb number agreement dependencies have found a consistent facilitatory effect in ungrammatical sentences but no conclusive effect in grammatical sentences. Existing models propose that interference is caused either by a faulty representation of the input (encoding-based models) or by difficulty in retrieving the subject based on cues at the verb (retrieval-based models). Neither class of model captures the observed patterns in human reading time data. We propose a new model that integrates a feature encoding mechanism into an existing cue-based retrieval model. Our model outperforms the cue-based retrieval model in explaining interference effect data from both grammatical and ungrammatical sentences. These modeling results yield a new insight into sentence processing, encoding modulates retrieval. Nouns stored in memory undergo feature distortion, which in turn affects how retrieval unfolds during dependency completion.

The second paper will appear in the International Conference on Cognitive Modeling (ICCM) proceedings:

Download: https://psyarxiv.com/3et95/

Title: Is similarity-based interference caused by lossy compression or cue-based retrieval? A computational evaluation

Abstract: The similarity-based interference paradigm has been widely used to investigate the factors subserving subject-verb agreement processing. A consistent finding is facilitatory interference effects in ungrammatical sentences but inconclusive results in grammatical sentences. Existing models propose that interference is caused either by misrepresentation of the input (representation distortion-based models) or by mis-retrieval of the interfering noun phrase based on cues at the verb (retrieval-based models). These models fail to fully capture the observed interference patterns in the experimental data. We implement two new models under the assumption that a comprehender utilizes a lossy memory representation of the intended message when processing subject-verb agreement dependencies. Our models outperform the existing cue-based retrieval model in capturing the observed patterns in the data for both grammatical and ungrammatical sentences. Lossy compression models under different constraints can be useful in understanding the role of representation distortion in sentence comprehension.

Wednesday, April 21, 2021

Video recording of my talk at Stanford (April 20, 2021)

I gave a talk at Stanford on April 20, 2021. Here is the recording:

Tuesday, April 20, 2021

New paper in Cognitive Science (open access): A Computational Evaluation of Two Models of Retrieval Processes in Sentence Processing in Aphasia

An exciting new paper by my PhD student Paula Lissón

Download from here: https://onlinelibrary.wiley.com/doi/10.1111/cogs.12956

Code and data: https://osf.io/kdjqz/

Title: A Computational Evaluation of Two Models of Retrieval Processes in Sentence Processing in Aphasia

Authors: Paula Lissón, Dorothea Pregla, Bruno Nicenboim, Dario Paape, Mick L. van het Nederend, Frank Burchert, Nicole Stadie, David Caplan, Shravan Vasishth

Abstract:

Can sentence comprehension impairments in aphasia be explained by difficulties arising from dependency completion processes in parsing? Two distinct models of dependency completion difficulty are investigated, the Lewis and Vasishth (2005) activation‐based model and the direct‐access model (DA; McElree, 2000). These models' predictive performance is compared using data from individuals with aphasia (IWAs) and control participants. The data are from a self‐paced listening task involving subject and object relative clauses. The relative predictive performance of the models is evaluated using k‐fold cross‐validation. For both IWAs and controls, the activation‐based model furnishes a somewhat better quantitative fit to the data than the DA model. Model comparisons using Bayes factors show that, assuming an activation‐based model, intermittent deficiencies may be the best explanation for the cause of impairments in IWAs, although slowed syntax and lexical delayed access may also play a role. This is the first computational evaluation of different models of dependency completion using data from impaired and unimpaired individuals. This evaluation develops a systematic approach that can be used to quantitatively compare the predictions of competing models of language processing.

Sunday, April 18, 2021

New paper (to appear in Open Mind):

A postdoc in our lab, Dario Paape, has had a paper accepted in the MIT Press open access journal Open Mind, which is one of the few serious open access journals available as an outlet for psycholinguists (another is Glossa Psycholinguistics). Unlike many of the so-called open access journals out there, Open Mind is a credible journal, not least because of its editorial board (the editor in chief is none other than Ted Gibson). The review process was as or more thoughtful and more thorough than I have experience in journals like Journal of Memory and Language (definitely a notch over Cognition). I am hopeful that we as a community can break free from these for-profit publishers and move towards open access journals like Open Mind and Glossa Psycholinguistics.

Download preprint from here: https://psyarxiv.com/2ztgw/

Title: Does local coherence lead to targeted regressions and illusions of grammaticality?

Authors: Dario Paape, Shravan Vasishth, and Ralf Engbert

Abstract: Local coherence effects arise when the human sentence processor is temporarily misled by a locally grammatical but globally ungrammatical analysis ("The coach smiled at THE PLAYER TOSSED A FRISBEE by the opposing team"). It has been suggested that such effects occur either because sentence processing occurs in a bottom-up, self-organized manner rather than being under constant grammatical supervision (Tabor, Galantucci, & Richardson, 2004), or because local coherence can disrupt processing due to readers maintaining uncertainty about previous input (Levy, 2008). We report the results of an eye-tracking study in which subjects read German grammatical and ungrammatical sentences that either contained a locally coherent substring or not and gave binary grammaticality judgments. In our data, local coherence affected on-line processing immediately at the point of the manipulation. There was, however, no indication that local coherence led to illusions of grammaticality (a prediction of self-organization), and only weak, inconclusive support for local coherence leading to targeted regressions to critical context words (a prediction of the uncertain-input approach). We discuss implications for self-organized and noisy-channel models of local coherence.

New paper: Individual differences in cue-weighting in sentence comprehension: An evaluation using Approximate Bayesian Computation

My PhD student Himanshu Yadav has recently submitted this amazing paper for review to a journal. This is the first in a series of papers that we are working on relating to the important topic of individual-level variability in sentence processing, a topic of central concern in our Collaborative Research Center on variability at Potsdam.

Download the preprint from here: https://psyarxiv.com/4jdu5/

Title: Individual differences in cue-weighting in sentence comprehension: An evaluation using Approximate Bayesian Computation

Authors: Himanshu Yadav, Dario Paape, Garrett Smith, Brian Dillon, and Shravan Vasishth

Abstract: Cue-based retrieval theories of sentence processing assume that syntactic dependencies are resolved through a content-addressable search process. An important recent claim is that in certain dependency types, the retrieval cues are weighted such that one cue dominates. This cue-weighting proposal aims to explain the observed average behavior, but here we show that there is systematic individual-level variation in cue weighting. Using the Lewis and Vasishth cue-based retrieval model, we estimated individual-level parameters for processing speed and cue weighting using 13 published datasets; hierarchical Approximate Bayesian Computation (ABC) was used to estimate the parameters. The modeling reveals a nuanced picture of cue weighting: we find support for the idea that some participants weight cues differentially, but not all participants do. Only fast readers tend to have the higher weighting for structural cues, suggesting that reading proficiency might be associated with cue weighting. A broader achievement of the work is to demonstrate how individual differences can be investigated in computational models of sentence processing without compromising the complexity of the model.

Wednesday, March 31, 2021

New paper: The benefits of preregistration for hypothesis-driven bilingualism research

Download from: here

The benefits of preregistration for hypothesis-driven bilingualism research

Daniela Mertzen, Sol Lago and Shravan Vasishth

Preregistration is an open science practice that requires the specification of research hypoth- eses and analysis plans before the data are inspected. Here, we discuss the benefits of preregis- tration for hypothesis-driven, confirmatory bilingualism research. Using examples from psycholinguistics and bilingualism, we illustrate how non-peer reviewed preregistrations can serve to implement a clean distinction between hypothesis testing and data exploration. This distinction helps researchers avoid casting post-hoc hypotheses and analyses as con- firmatory ones. We argue that, in keeping with current best practices in the experimental sciences, preregistration, along with sharing data and code, should be an integral part of hypothesis-driven bilingualism research.

Friday, March 26, 2021

Freshly minted professor from our lab: Prof. Dr. Titus von der Malsburg

One of my first PhD students, Titus von der Malsburg, has just been sworn in as a Professor of Psycholinguistics and Cognitive Modeling (tenure track assistant professor) at the Institute of Linguistics, University of Stuttgart in Germany. Stuttgart is one of the most exciting places to be in Germany for computationally oriented scientists.

Titus is the eighth professor coming out of my lab. He does very exciting work in psycholinguistics; check out his work here.

Wednesday, March 17, 2021

New paper: Workflow Techniques for the Robust Use of Bayes Factors

Workflow Techniques for the Robust Use of Bayes Factors

Download from: https://arxiv.org/abs/2103.08744

Daniel J. Schad, Bruno Nicenboim, Paul-Christian Bürkner, Michael Betancourt, Shravan Vasishth

Inferences about hypotheses are ubiquitous in the cognitive sciences. Bayes factors provide one general way to compare different hypotheses by their compatibility with the observed data. Those quantifications can then also be used to choose between hypotheses. While Bayes factors provide an immediate approach to hypothesis testing, they are highly sensitive to details of the data/model assumptions. Moreover it's not clear how straightforwardly this approach can be implemented in practice, and in particular how sensitive it is to the details of the computational implementation. Here, we investigate these questions for Bayes factor analyses in the cognitive sciences. We explain the statistics underlying Bayes factors as a tool for Bayesian inferences and discuss that utility functions are needed for principled decisions on hypotheses. Next, we study how Bayes factors misbehave under different conditions. This includes a study of errors in the estimation of Bayes factors. Importantly, it is unknown whether Bayes factor estimates based on bridge sampling are unbiased for complex analyses. We are the first to use simulation-based calibration as a tool to test the accuracy of Bayes factor estimates. Moreover, we study how stable Bayes factors are against different MCMC draws. We moreover study how Bayes factors depend on variation in the data. We also look at variability of decisions based on Bayes factors and how to optimize decisions using a utility function. We outline a Bayes factor workflow that researchers can use to study whether Bayes factors are robust for their individual analysis, and we illustrate this workflow using an example from the cognitive sciences. We hope that this study will provide a workflow to test the strengths and limitations of Bayes factors as a way to quantify evidence in support of scientific hypotheses. Reproducible code is available from this https URL.

Also see this interesting twitter thread on this paper by Michael Betancourt:

I believe this paper was initiated towards the end of drafting the Bayesian workflow in cognitive science paper with Daniel and @ShravanVasishth when I mentioned that many of the workflow ideas could be generalized to Bayes factor implementations with a little bit of work.
— \mathfrak{Michael "Shapes Dude" Betancourt} (@betanalpha) March 17, 2021

Monday, March 15, 2021

New paper: Is reanalysis selective when regressions are consciously controlled?

A new paper by Dr. Dario Paape; download from here: https://psyarxiv.com/gnehs

Abstract

The selective reanalysis hypothesis of Frazier and Rayner (1982) states that readers direct their eyes towards critical words in the sentence when faced with garden-path structures (e.g., Since Jay always jogs a mile seems like a short distance to him). Given the mixed evidence for this proposal in the literature, we investigated the possibility that selective reanalysis is tied to conscious awareness of the garden-path effect. To this end, we adapted the well-known self-paced reading paradigm to allow for regressive as well as progressive key presses. Assuming that regressions in such a paradigm are consciously controlled, we found no evidence for selective reanalysis, but rather for occasional extensive, heterogeneous rereading of garden-path sentences. We discuss the implications of our findings for the selective reanalysis hypothesis, the role of awareness in sentence processing, as well as the usefulness of the bidirectional self-paced reading method for sentence processing research.

Tuesday, March 09, 2021

Talk at Stanford (April 20 2021) Dependency completion in sentence processing: Some recent computational and empirical investigations

Title: Dependency completion in sentence processing: Some recent computational and empirical investigations

When: April 20, 2021, 9PM German time

Where: zoom.

How to watch: https://linguistics.stanford.edu/events/dependency-completion-sentence-processing-some-recent-computational-and-empirical

Shravan Vasishth (vasishth.github.io)

Abstract:

Dependency completion processes in sentence processing have been intensively studied in psycholinguistics (e.g., Gibson 2000). I will discuss some recent work (e.g., Yadav et al. 2021) on computational models of dependency completion as they relate to a class of effects, so-called interference effects (Jäger et al., 2017). Using antecedent-reflexive and subject-verb number dependencies as a case study (Jäger et al., 2020), I will discuss the evidence base for some of the competing theoretical claims relating to these phenomena. A common thread running through the talk will be that the well-known replication and statistical crisis in psychology and other areas (Nosek et al., 2015, Gelman and Carlin, 2014) is also unfolding in psycholinguistics and needs to be taken seriously (e.g., Vasishth, et al., 2018).

References

Andrew Gelman and John Carlin (2014). Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641-651.

Edward Gibson, (2000). The dependency locality theory: A distance-based theory of linguistic complexity. Image, Language, Brain, 2000, 95-126.

Lena A. Jäger, Felix Engelmann, and Shravan Vasishth, (2017). Similarity-based interference in sentence comprehension: Literature review and Bayesian meta-analysis. Journal of Memory and Language, 94:316-339.

Lena A. Jäger, Daniela Mertzen, Julie A. Van Dyke, and Shravan Vasishth, (2020). Interference patterns in subject-verb agreement and reflexives revisited: A large-sample study. Journal of Memory and Language, 111.

Brian A. Nosek, & Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716-aac4716.

Shravan Vasishth, Daniela Mertzen, Lena A. Jäger, and Andrew Gelman, (2018). The statistical significance filter leads to overoptimistic expectations of replicability. Journal of Memory and Language, 103:151-175.

Shravan Vasishth and Felix Engelmann, (2021). Sentence comprehension as a cognitive process: A computational approach. Cambridge University Press. In Press.

Himanshu Yadav, Garrett Smith, and Shravan Vasishth, (2021). Feature encoding modulates cue-based retrieval: Modeling interference effects in both grammatical and ungrammatical sentences. Submitted.

Wednesday, March 03, 2021

Talk at Hong Kong Virtual Psycholinguistics Forum (VPF, 心理语言学线上论坛)

I'll be giving at talk at the Chinese University of Hong Kong.
When: 10 March 2021
When: 10AM Berlin time
Where: Zoom:
https://cuhk.zoom.us/j/779556638
https://cuhk.zoom.cn/j/779556638 (mainland China)
Title: Case and Agreement Attraction in Armenian: Experimental and Computational Investigations
Abstract: https://osf.io/3wn79/

Monday, February 22, 2021

Video recording of talk at Tuebingen: Individual differences in sentence processing

Here is the video recording of my talk from Feb 22, 2021:

Thursday, February 11, 2021

Talk in Tuebingen: Individual differences in cue-weighting in sentence comprehension: An evaluation using Approximate Bayesian Computation

When: Feb 22 2021
Where: Universität Tübingen, Seminar für Sprachwissenschaft
How: Zoom

[This is part of the PhD work of Himanshu Yadav, and the project is led by him. Co-authors: Dario Paape, Garrett Smith, and Brian Dillon.]

Abstract
Cue-based retrieval theories of sentence processing assume that syntactic dependencies are resolved through a content-addressable search process. An important recent claim is that in certain dependency types, the retrieval cues are weighted such that one cue dominates. This cue-weighting proposal aims to explain the observed average behavior. We show that there is systematic individual-level variation in cue weighting. Using the Lewis and Vasishth cue-based retrieval model, we estimated individual-level parameters for processing speed and cue weighting using data from 13 published reading studies; hierarchical Approximate Bayesian Computation (ABC) with Gibbs sampling was used to estimate the parameters. The modeling reveals a nuanced picture about cue-weighting: we find support for the idea that some participants weight cues, but not all do; and only fast readers tend to have the predicted cue weighting, suggesting that reading proficiency might be associated with cue weighting. A broader achievement of the work is to demonstrate how individual differences can be investigated in computational models of sentence processing using hierarchical ABC.

Tuesday, February 02, 2021

Bayesian statistics: A tutorial taught at Experimental Methods for Language Acquisition research (EMLAR XVII 2021)

Bayesian statistics Taught by Shravan Vasishth (vasishth.github.io) When: Sometime between 13 and 15 April 2021 Where: https://emlar.wp.hum.uu.nl/tutorial/bayesian-statistics/ Bayesian methods are increasingly becoming mainstream in psychology and psycholinguistics. However, finding an entry point into using these methods is often difficult for researchers. In this tutorial, I will provide an informal introduction to the fundamental ideas behind Bayesian statistics, using examples illustrating applications to psycholinguistics. I will also illustrate some of the advantages of the Bayesian approach over the standardly used frequentist paradigms: uncertainty quantification, robust estimates, the ability to incorporate expert and/or prior knowledge into the data analysis, and the ability to flexibly define the generative process and thereby to directly address the actual research question (as opposed to a straw-man null hypothesis). Suggestions for further readings will be provided. References Bruno Nicenboim, Daniel Schad, and Shravan Vasishth. Introduction to Bayesian Data Analysis for Cognitive Science. 2021. Under contract with Chapman and Hall/CRC Statistics in the Social and Behavioral Sciences Series. https://vasishth.github.io/bayescogsci/ Daniel J. Schad, Michael Betancourt, and Shravan Vasishth. Towards a principled Bayesian workflow: A tutorial for cognitive science. Psychological Methods, 2020. In Press. https://osf.io/b2vx9/ Shravan Vasishth, Daniela Mertzen, Lena A. Jäger, and Andrew Gelman. The statistical significance filter leads to overoptimistic expectations of replicability. Journal of Memory and Language, 103:151-175, 2018. https://www.sciencedirect.com/science/article/pii/S0749596X18300640?via%3Dihub Shravan Vasishth, Bruno Nicenboim, Mary E. Beckman, Fangfang Li, and Eun Jong Kong. Bayesian data analysis in the phonetic sciences: A tutorial introduction. Journal of Phonetics, 71:141-161, 2018. https://osf.io/g4zpv/ Bruno Nicenboim and Shravan Vasishth. Statistical methods for linguistic research: Foundational Ideas - Part II. Language and Linguistics Compass, 10:591-613, 2016. https://onlinelibrary.wiley.com/doi/abs/10.1111/lnc3.12207

Saturday, January 16, 2021

Applications are open for the fifth summer school in statistical methods for linguistics and psychology (SMLP)

The annual summer school, now in its fifth edition, will happen 6-10 Sept 2021, and will be conducted virtually over zoom. The summer school is free and is funded by the DFG through SFB 1287.
Instructors: Doug Bates, Reinhold Kliegl, Phillip Alday, Bruno Nicenboim, Daniel Schad, Anna Laurinavichyute, Paula Lisson, Audrey Buerki, Shravan Vasishth.
There will be four streams running in parallel: introductory and advances courses on frequentist and Bayesian statistics. Details, including how to apply, are here.

Saturday, January 02, 2021

Should statistical data analysis in psychology be like defecating?

There was an interesting thread on twitter about linear mixed models (LMMs) that someone made me aware of recently. (I stopped following twitter because of its general inanity, but this thread is worth commenting on.) The gist of the complaints (trying to recreate this list from memory) were. My list is an amalgamation of comments from different people; I think that the thread started here:

Inspired by @IrisVanRooij, I want to express some concerns that may be controversial and even outrageous to some but I feel we at least should have a discussion. I'm wondering if statistics in psycholinguistics could use a rethink. It feels like the tail now wags the dog.
— Fernanda Ferreira (@fernandaedi) December 20, 2020

To summarize the complaints:

- LMMs take too long to fit (cf. repeated measures ANOVA). This slows down student output.

- Too much time is spent on thinking about what the right analysis is.

- The interpretation of LMMs can change dramatically depending on which model you fit.

- Reviewers will always object to whatever analysis one does and demand a different one. Often which analysis one does doesn't matter as regards interpretation.

- The lme4 package exhibits all kinds of weird and unstable behavior. Should we trust its output?

- The focus has shifted away from substantive theoretical issues within psych* to statistical methods, but psych* people cannot be statisticians and can never know enough. This led to the colorful comment that doing statistics should be like taking a crap---it shouldn't become the center of your entire existence.

Indeed, a mathematical psychologist I know, someone who knows what they're doing, once told me that if you cannot answer your question with a paired t-test, you are asking the wrong question. In fact, if I go back to my existing data-sets that I have published between 2002 and 2020, almost all of them can be reasonably analyzed using a series of paired t-tests.

There is a presupposition that lies behind the above complaints: the purpose of data analysis is to find out whether an effect is significant or not. Once one understands that that's not the primary purpose of a statistical analysis, things start to make more sense. The problem is that it's just very hard to comprehend this point; this is because the idea of null hypothesis significance testing is very deeply entrenched in our minds. Walking away from it feels impossible.

Here are some thoughts about the above objections.

1. If you want the simplicity of paired t-tests and repeated measures ANOVA, absolutely go for it. But release your data and code, and be open to others analyzing your data differently. I think it's perfectly fine to spend your entire life doing just paired t-tests and publishing the resulting t and p-values. Of course, you are still fitting linear mixed models, but heavily simplified ones. Sometimes it won't matter whether you fit a complicated model or a simple one, but sometimes it will. It has happened to me that a paired t-test was exactly the wrong thing to do, and I spent a lot of time trying to model the data differently. Should one care about these edge cases? I think this is a subjective decision that each one of us has to make individually. Here is another example of a simple two-condition study where a complicated model that took forever to fit gave new insight into the underlying process generating the data. The problem here comes down to the goal of a statistical analysis. If we accept the premise that statistical significance is the goal, then we should just go ahead and fit that paired t-test. If, instead, the goal is to model the generative process, then you will start losing time. What position you take really depends on what you want to achieve.

2. There is no one right analysis, and reviewers will always object to whatever analysis you present. The reason that reviewers propose alternative analyses has nothing to do with the inherent flexibility of statistical methods. It has to do with academics being contrarians. I notice this in my own behavior: if my student does X, I want them to do Y!=X. If they do Y, I want them to do X!=Y. I suspect that academics are a self-selected lot, and one thing they are good at is objecting to whatever someone else says or does. So, the fact that reviewers keep asking for different analyses is just the price one has to pay for dealing with academics, it's not an inherent problem with statistics per se. Notice that reviewers also object to the logic of a paper, and to the writing. We are so used to dealing with those things that we don't realize it's the same type of reaction we are seeing to the statistical analyses.

3. If you want speed and still want to fit linear mixed models, use the right tools. There are plenty of ways to fit linear mixed models fast. rstanarm, LMMs in Julia, etc. E.g., Doug Bates, Phillip Alday, and Reinhold Kliegl taught a one-week course on fitting LMMs super fast in Julia: see here.

4. The interpretation of linear mixed models depends on model specification. This surprises many people, but the surprise is due to the fact that people have a very incomplete understanding of what they are doing. If you cannot be bothered to study linear mixed modeling theory (understandable, life is short), stick to paired t-tests.

5. lme4's unstable and weird behavior is problematic, but this is not enough reason to abandon linear mixed models. The weirdness of messages, and the inconsistencies of lme4 are really frustrating, one has to admit that. Perhaps this is the price one has to pay for free software (although, having used non-free software like Word, SPSS, Excel, I'm not so sure there is any advantage). But the fact is that LMMs give you the power to incorporate variance components in a sensible way, and lme4 does the job, if you know what you are doing. Like any other instrument one thinks about using as a professional, if you can't be bothered to learn to use it, then just use some simpler method you do know how to use. E.g., I can't use fMRI; I don't have access to the equipment. I'm forced to work with simpler methods, and I have to live with that. If you want more control over your hierarchical models than lme4 provides, learn Stan. E.g., see our chapter on hierarchical models here.

Personally, I think that it is possible to learn enough statistics to be able to use linear mixed models competently; one doesn't need to become a statistician. The curriculum I think one needs in psych and related areas is encapsulated in our summer school on statistical methods, which we run annually at Potsdam. It's a time commitment, but it's worth it. I have seen many people go from zero knowledge to fitting sophisticated hierarchical models, so I know that people can learn all this without it taking over their entire life.

Probably the biggest problem behind all these complaints is the misunderstanding surrounding null hypothesis significance testing. Unfortunately,p-values will rarely tell you anything useful, significant or not, unless you are willing to put in serious time and effort (the very thing people want to avoid doing). So it really not going to matter much whether you compute them using paired t-tests or linear mixed models.

Search