This blog is a repository of cool things relating to statistical computing, simulation and stochastic modeling.
Search
Thursday, September 28, 2023
Prof. Himanshu Yadav: Potsdam -> IIT Kanpur
Saturday, March 11, 2023
New paper: SEAM: An Integrated Activation-Coupled Model of Sentence Processing and Eye Movements in Reading.
Michael Betancourt, a giant in the field of Bayesian statistical modeling, once indirectly pointed out to me (in a podcast interview) that one should not try to model latent cognitive processes in reading by computing summary statistics like the mean difference between conditions and then fitting the model on those summary statistics. But that is exactly what we do in psycholinguistics. Most models (including those from my lab) evaluate model performance on summary statistics from the data (usually, a mean difference), abstracting away quite dramatically from the complex processes that resulted in those reading times and regressive eye movements.
What Michael wanted instead was a detailed process model of how the observed fixations and eye movement patterns arise. Obviously, such a model would be extremely complicated, because one would have to specify the full details of oculomotor processes and their impact on eye movements, as well as a model of language comprehension, and specify how these components interact to produce eye movements at the single trial level. This kind of model will quickly become computationally intractable if one tries to estimate the model parameters using data. So that's a major barrier to building such a model.
Interestingly, both eye movement control models and models of sentence comprehension exist. But these live in parallel universes. Psychologists have almost always focused on eye movement control, ignoring the impact of sentence comprehension processes (I once heard a talk by a psychologist who publicly called out psycholinguists, labeling them as "crazy" for studying language processing in reading :). Similarly, most psycholinguists just ignore the lower-level processes unfolding in reading, and just assume that language processing events are responsible for differences in fixation durations or in left-ward eye movements (regressions). The most that psycholinguists like me are willing to do is add word frequency etc. as a co-predictor to reading time or other dependent measures when investigating reading. But in most cases even that would go too far :).
What is missing is a model that brings these two lines of work into one integrated reading model that co-determines where we move our eyes to and for how long.
Max Rabe, who is wrapping up his PhD work in psychology at Potsdam in Germany, demonstrates how this could be done: he takes a fully specified model of eye movement control in reading (SWIFT) and integrates into it linguistic dependency completion processes, following the principles of the cognitive architecture ACT-R. A key achievement is that the activation of a word being read is co-determined by both oculomotor processes as specified in SWIFT, and cue-based retrieval processes as specified in the activation-based model of retrieval. A key achievement is to show how regressive eye movements are triggered when sentence processing difficulty (here, similarity-based interference) arises during reading.
What made the model fitting possible was Bayesian parameter estimation: Max Rabe shows in an earlier (2021) Psychological Review paper (preprint here) how parameter estimation can be carried out in complex models where the likelihood function may not be easy to work out.
Download the paper from arXiv.
Thursday, June 02, 2022
New paper in Journal of Memory and Language: Share the code, not just the data
Here is an important paper for the field of psycholionguistics that just came out in JML. It is led by Dr. Anna Laurinavichyute and was commissioned by the editor of JML (Prof. Kathy Rastle).
Share the code, not just the data: A case study of the reproducibility of articles published in the Journal of Memory and Language under the open data policy
Download here: https://doi.org/10.1016/j.jml.2022.104332
Friday, May 27, 2022
Summer School “Methods in Language Sciences” (16-20 August 2022, Ghent, Belgium): Registrations open
Saturday, April 16, 2022
Ever wondered how the probability of the null hypothesis being true changes given a significant result?
TRIGGER WARNING: These simulations might fundamentally shake your belief system. USE WITH CARE.
In a recently accepted paper in the open access journal Quantitative Methods for Psychology that Daniel Schad led, we discuss how, using Bayes' rule, one can explore the change in the probability of a null hypothesis being true (call it theta) when you get a significant effect. The paper, which was inspired by a short comment in McElreath's book (first edition), shows that theta does not necessarily change much even if you get a significant result. The probability theta can change dramatically under certain conditions, but those conditions are either so stringent or so trivial that it renders many of the significance-based conclusions in psychology and psycholinguistics questionable at the very least.
You can do your own simulations, under assumptions that you consider more appropriate for your own research problem, using this shiny app (below), or play with the source code: here.
Thursday, March 31, 2022
New(ish) paper: Share the code, not just the data: A case study of the reproducibility of JML articles published under the open data policy
Here's an important new paper led by Dr. Anna Laurinavichyute on the reproducibility of published analyses. This paper by commissioned by the editor in chief of the Journal of Memory and Language, Kathy Rastle.
Title: Share the code, not just the data: A case study of the reproducibility of JML articles published under the open data policy
Abstract:
In 2019 the Journal of Memory and Language instituted an open data and code policy; this policy requires that, as a rule, code and data be released at the latest upon publication. How effective is this policy? We compared 59 papers published before, and 59 papers published after, the policy took effect. After the policy was in place, the rate of data sharing increased by more than 50%. We further looked at whether papers published under the open data policy were reproducible, in the sense that the published results should be possible to regenerate given the data, and given the code, when code was provided. For 8 out of the 59 papers, data sets were inaccessible. The reproducibility rate ranged from 34% to 56%, depending on the reproducibility criteria. The strongest predictor of whether an attempt to reproduce would be successful is the presence of the analysis code: it increases the probability of reproducing reported results by almost 40%. We propose two simple steps that can increase the reproducibility of published papers: share the analysis code, and attempt to reproduce one’s own analysis using only the shared materials.
PDF: here.
Wednesday, March 23, 2022
Short course and keynote on statistical methods at Ghent Summer School on Methods in Language Sciences
I will be teaching an in-person course on linear mixed modeling at the summer school at Ghent (below) August 2022.
The summer school home page: https://www.mils.ugent.be/
1. 2.5 day course: Introduction to linear mixed modelling for linguists
When and where: August 18, 19, 20, 2022 in Ghent.
Prerequisites and target audience
The target audience is graduate students in linguistics.
I assume familiarity with graphical descriptive summaries of data of the type
encountered in linguistics; the most important theoretical distributions
(normal, t, binomial, chi-squared); description of univariate and bivariate data
(mean, variance, standard deviation, correlation, cross-tabulations);
graphical presentation of univariate and bivariate/multivariate data
(bar chart, histogram, boxplot, qq-plot, etc.);
point estimators and confidence intervals for population averages
with normal data or large samples;
null hypothesis significance testing;
t-test, Chi-square test, simple linear regression.
A basic knowledge of R is assumed.
Curriculum:
I will cover some important ideas relating to linear mixed models
and how they can be used in linguistics research. I will loosely follow
my textbook draft: https://vasishth.github.io/Freq_CogSci/
Topics to be covered:
- Linear mixed models: basic theory and applications
- Contrast coding
- Generalized Linear Mixed Models (binomial link)
- Using simulation for power analysis and for understanding one’s model
2. Keynote lecture
Using Bayesian Data Analysis in Language Research
Shravan VasishthBayesian methods are becoming a standard part of the toolkit for
psycholinguists, linguists, and psychologists. This transition has
been sped up by the arrival of easy-to-use software like brms, a
front-end for the probabilistic programming language Stan. In this
talk, I will show how Bayesian analyses differ from frequentist
analogues, focusing on the linear mixed model. I will illustrate the
main advantages of Bayes: a direct, nuanced, and conservative answer
to the research question at hand, flexible model specification, the
ability to incorporate prior knowledge in the model, and a focus on
uncertainty quantification.
References
Daniel J. Schad, Bruno Nicenboim, Paul-Christian Bürkner, Michael
Betancourt, and Shravan Vasishth. Workflow Techniques for the Robust
Use of Bayes Factors. Psychological Methods, 2022.
https://doi.apa.org/doiLanding
Shravan Vasishth and Andrew Gelman. How to embrace variation and
accept uncertainty in linguistic and psycholinguistic data analysis.
Linguistics, 59:1311--1342, 2021.
https://www.degruyter.com/docu
2019-0051/html
Shravan Vasishth. Some right ways to analyze (psycho)linguistic data.
Submitted, 2022.
https://osf.io/5wzyg/
New paper: Some right ways to analyze (psycho)linguistic data
New paper (under review):
Title: Some right ways to analyze (psycho)linguistic data
Abstract:
Much has been written on the abuse and misuse of statistical methods, including p-values, statistical significance, etc. I present some of the best practices in statistics using a running example data analysis. Focusing primarily on frequentist and Bayesian linear mixed models, I illustrate some defensible ways in which statistical inference—specifically, hypothesis testing using Bayes factors vs. estimation or uncertainty quantification—can be carried out. The key is to not overstate the evidence and to not expect too much from statistics. Along the way, I demonstrate some powerful ideas, the most important ones being using simulation to understand the design properties of one’s experiment before running it, visualizing data before carrying out a formal analysis, and simulating data from the fitted model to understand the model’s behavior.
PDF: https://psyarxiv.com/y54va/
Summer School on Statistical Methods for Linguistics and Psychology, Sept. 12-16, 2022 (applications close April 1)
The application form closes April 1, 2022. We will announce the decisions on or around April 15, 2022.
Course fee: There is no fee because the summer school is funded by the Collaborative Research Center (Sonderforschungsbereich 1287). However, we will charge 40 Euros to cover costs for coffee and snacks during the breaks and social hours. And participants will have to pay for their own accommodation.
For details, see: https://vasishth.github.io/
Curriculum:
1. Introduction to Bayesian data analysis (maximum 30 participants). Taught by Shravan Vasishth, assisted by Anna Laurinavichyute, and Paula Lissón
This course is an introduction to Bayesian modeling, oriented towards linguists and psychologists. Topics to be covered: Introduction to Bayesian data analysis, Linear Modeling, Hierarchical Models. We will cover these topics within the context of an applied Bayesian workflow that includes exploratory data analysis, model fitting, and model checking using simulation. Participants are expected to be familiar with R, and must have some experience in data analysis, particularly with the R library lme4.Course Materials Previous year's course web page: all materials (videos etc.) from the previous year are available here.
Textbook: here. We will work through the first six chapters.
This course assumes that participants have some experience in Bayesian modeling already using brms and want to transition to Stan to learn more advanced methods and start building simple computational cognitive models. Participants should have worked through or be familiar with the material in the first five chapters of our book draft: Introduction to Bayesian Data Analysis for Cognitive Science. In this course, we will cover Parts III to V of our book draft: model comparison using Bayes factors and k-fold cross validation, introduction and relatively advanced models with Stan, and simple computational cognitive models.
Participants will be expected to have used linear mixed models before, to the level of the textbook by Winter (2019, Statistics for Linguists), and want to acquire a deeper knowledge of frequentist foundations, and understand the linear mixed modeling framework more deeply. Participants are also expected to have fit multiple regressions. We will cover model selection, contrast coding, with a heavy emphasis on simulations to compute power and to understand what the model implies. We will work on (at least some of) the participants' own datasets. This course is not appropriate for researchers new to R or to frequentist statistics.
Applicants must have experience with linear mixed models and be interested in learning how to carry out such analyses with the Julia-based MixedModels.jl package) (i.e., the analogue of the R-based lme4 package). MixedModels.jl has some significant advantages. Some of them are: (a) new and more efficient computational implementation, (b) speed — needed for, e.g., complex designs and power simulations, (c) more flexibility for selection of parsimonious mixed models, and (d) more flexibility in taking into account autocorrelations or other dependencies — typical EEG-, fMRI-based time series (under development). We do not expect profound knowledge of Julia from participants; the necessary subset of knowledge will be taught on the first day of the course. We do expect a readiness to install Julia and the confidence that with some basic instruction participants will be able to adapt prepared Julia scripts for their own data or to adapt some of their own lme4-commands to the equivalent MixedModels.jl-commands. The course will be taught in a hybrid IDE. There is already the option to execute R chunks from within Julia, meaning one needs Julia primarily for execution of MixedModels.jl commands as replacement of lme4. There is also an option to call MixedModels.jl from within R and process the resulting object like an lme4-object. Thus, much of pre- and postprocessing (e.g., data simulation for complex experimental designs; visualization of partial-effect interactions or shrinkage effects) can be carried out in R.
Course Materials Github repo: here.
New paper in Computational Brain and Behavior: Sample size determination in Bayesian Linear Mixed Models
We've just had a paper accepted in Computational Brain and Behavior, an open access journal of the Society for Mathematical Psychology.
Even though I am not a psychologist, I feel an increasing affinity to this field compared to psycholinguistics proper. I will be submitting more of my papers to this journal and other open access journals (Glossa Psycholx, Open Mind in particular) in the future.
Some things I liked about this journal:
- A fast and well-informed, intelligent, useful set of reviews. The reviewers actually understand what they are talking about! It's refreshing to find people out there who speak my language (and I don't mean English or Hindi). Also, the reviewers signed their reviews. This doesn't usually happen.
- Free availability of the paper after publication; I didn't have to do anything to make this happen. By contrast, I don't even have copies of my own articles published in APA journals. The same goes for Elsevier journals like the Journal of Memory and Language. Either I shell out $$$ to make the paper open access, or I learn to live with the arXiv version of my paper.
- The proofing was *excellent*. By contrast, the Journal of Memory and Language adds approximately 500 mistakes into my papers every time they publish it (then we have to correct them, if we catch them at all). E.g., in this paper we had to issue a correction about a German example; this error was added by the proofer! Another surprising example of JML actually destroying our paper's formatting is this one; here, the arXiv version has better formatting than the published paper, which cost several thousand Euros!
- LaTeX is encouraged. By contrast, APA journals demand that papers be submitted in W**d.
Here is the paper itself: here, we present an approach, adapted from the work of two statisticians (Wang and Gelfand), for determining approximate sample size needed for drawing meaningful inferences using Bayes factors in hierarchical models (aka linear mixed models). The example comes from a psycholinguistic study but the method is general. Code and data are of course available online.
The pdf: https://link.springer.com/article/10.1007/s42113-021-00125-y
Thursday, February 03, 2022
EMLAR 2022 tutorial on Bayesian methods
At EMLAR 2022 I will teach two sessions that will introduce Bayesian methods. Here is the abstract for the two sessions:
EMLAR 2022: An introduction to Bayesian data analysis
Taught by Shravan Vasishth (vasishth.github.io)
Session 1. Tuesday 19 April 2022, 1-3PM (Zoom link will be provided)
Modern probabilistic programming languages like Stan (mc-stan.org)
have made Bayesian methods increasingly accessible to researchers
in linguistics and psychology. However, finding an entry point
into these methods is often difficult for researchers. In this
tutorial, I will provide an informal introduction to the
fundamental ideas behind Bayesian statistics, using examples
that illustrate applications to psycholinguistics.
I will also discuss some of the advantages of the Bayesian
approach over the standardly used frequentist paradigms:
uncertainty quantification, robust estimates through regularization,
the ability to incorporate expert and/or prior knowledge into
the data analysis, and the ability to flexibly define the
generative process and thereby to directly address the actual research
question (as opposed to a straw-man null hypothesis).
Suggestions for further reading will be provided. In this tutorial,
I presuppose that the audience is familiar with linear mixed models
(as used in R with the package lme4).
Session 2. Thursday 21 April 2022, 9:30-11:30 (Zoom link will be provided)
This session presupposed that the participant has attended
Session 1. I will show some case studies using brms and Stan
code that will demonstrate the major applications of
Bayesian methods in psycholinguistics. I will reference/use some of
the material described in this online textbook (in progress):
Thursday, January 20, 2022
New opinion paper in Trends in Cognitive Sciences: Data Assimilation in Dynamical Cognitive Science (Engbert et al.)
Here's a new opinion paper in Trends in Cognitive Sciences, by Ralf Engbert, Max Rabe, et al.
Link to paper: here
Tuesday, December 14, 2021
New paper: Syntactic and semantic interference in sentence comprehension: Support from English and German eye-tracking data
A long-standing debate in the sentence processing literature concerns the time course of syntactic and semantic information in online sentence comprehension. The default assumption in cue-based models of parsing is that syntactic and semantic retrieval cues simultaneously guide dependency resolution. When retrieval cues match multiple items in memory, this leads to similarity-based interference. Both semantic and syntactic interference have been shown to occur in English. However, the relative timing of syntactic vs. semantic interference remains unclear. In this first-ever cross-linguistic investigation of the time course of syntactic vs. semantic interference, the data from two eye-tracking reading experiments (English and German) suggest that the two types of interference can in principle arise simultaneously during retrieval. However, the data also indicate that semantic cues may be evaluated with a small timing lag in German compared to English. This suggests that there may be cross-linguistic variation in how syntactic and semantic cues are used to resolve linguistic dependencies in real-time.
Download pdf from here: https://psyarxiv.com/ua9yv
New paper in Computational Brain and Behavior: Sample size determination for Bayesian hierarchical models commonly used in psycholinguistics
van Doorn, J., Aust, F., Haaf, J.M. et al. Bayes Factors for Mixed Models. Computational Brain and Behavior (2021). https://doi.org/10.1007/s42113-021-00113-2
There are quite a few papers in that special issue, all worth reading, but I especially liked the contribution by Singmann et al: Statistics in the Service of Science: Don't let the Tail Wag the Dog (https://psyarxiv.com/kxhfu/) They make some very good points in reaction to van Doorn et al's paper.
Abstract: We discuss an important issue that is not directly related to the main theses of the van Doorn et al. (2021) paper, but which frequently comes up when using Bayesian linear mixed models: how to determine sample size in advance of running a study when planning a Bayes factor analysis. We adapt a simulation-based method proposed by Wang and Gelfand (2002) for a Bayes-factor based design analysis, and demonstrate how relatively complex hierarchical models can be used to determine approximate sample sizes for planning experiments.
Code and data: https://osf.io/hjgrm/
pdf: here
Tuesday, December 07, 2021
New paper accepted in MIT Press Journal Open Mind: Individual differences in cue weighting in sentence comprehension: An evaluation using Approximate Bayesian Computation
The reviews from Open Mind were very high quality, certainly as high or higher quality than I have received from many top closed-access journals over the last 20 years. The journal has a top-notch editorial board, led by none other than Ted Gibson. This is our second paper in Open Mind; the first was this one. I plan to publish more of our papers in this journal (along with the other open access journal, Glossa Psycholinguistics, also led by a stellar set of editors, Fernanda Ferreira and Brian Dillon). I hope that these open access journals can become the norm for our field. I wonder what it will take for that to happen.
Himanshu Yadav, Dario Paape, Garrett Smith, Brian W. Dillon, and Shravan Vasishth. Individual differences in cue weighting in sentence comprehension: An evaluation using Approximate Bayesian Computation. Open Mind, 2021. Provisionally accepted.
The pdf is here.
Monday, December 06, 2021
New paper: Similarity-based interference in sentence comprehension in aphasia: A computational evaluation of two models of cue-based retrieval.
Title: Similarity-based interference in sentence comprehension in aphasia: A computational evaluation of two models of cue-based retrieval.
Abstract: Sentence comprehension requires the listener to link incoming words with short-term memory representations in order to build linguistic dependencies. The cue-based retrieval theory of sentence processing predicts that the retrieval of these memory representations is affected by similarity-based interference. We present the first large-scale computational evaluation of interference effects in two models of sentence processing – the activation-based model, and a modification of the direct-access model – in individuals with aphasia (IWA) and control participants in German. The parameters of the models are linked to prominent theories of processing deficits in aphasia, and the models are tested against two linguistic constructions in German: Pronoun resolution and relative clauses. The data come from a visual-world eye-tracking experiment combined with a sentence-picture matching task. The results show that both control participants and IWA are susceptible to retrieval interference, and that a combination of theoretical explanations (intermittent deficiencies, slow syntax, and resource reduction) can explain IWA’s deficits in sentence processing. Model comparisons reveal that both models have a similar predictive performance in pronoun resolution, but the activation-based model outperforms the direct-access model in relative clauses.
Download: here. Paula also has another paper modeling English data from unimpaired controls and individuals in aphasia, in Cognitive Science.
Friday, November 12, 2021
Book: Sentence comprehension as a cognitive process: A computational approach (Vasishth and Engelmann)
Sunday, October 10, 2021
New paper: When nothing goes right, go left: A large-scale evaluation of bidirectional self-paced reading
Here's an interesting and important new paper led by the inimitable Dario Paape:
Title: When nothing goes right, go left: A large-scale evaluation of bidirectional self-paced reading
Download from: here.
Abstract:
In two web-based experiments, we evaluated the bidirectional self-paced reading (BSPR) paradigm recently proposed by Paape and Vasishth (2021). We used four sentence types: NP/Z garden-path sentences, RRC garden-path sentences, sentences containing inconsistent discourse continuations, and sentences containing reflexive anaphors with feature-matching but grammatically unavailable antecedents. Our results show that regressions in BSPR are associated with a decrease in positive acceptability judgments. Across all sentence types, we observed online reading patterns that are consistent with the existing eye-tracking literature. NP/Z but not RRC garden-path sentences also showed some indication of selective rereading, as predicted by the selective reanalysis hypothesis of Frazier and Rayner (1982). However, selective rereading was associated with decreased rather than increased sentence acceptability, which is not in line with the selective reanalysis hypothesis. We discuss the implications regarding the connection between selective rereading and conscious awareness, and for the use of BSPR in general.
Thursday, September 30, 2021
New paper on the reproducibility of JML articles (2019-21) after the open data policy was introduced
New paper by Anna Laurinavichyute and me:
The (ir)reproducibility of published analyses: A case study of 57 JML articles published between 2019 and 2021
Download from: https://psyarxiv.com/hf297/