Sunday, December 19, 2021

Generating data from a uniform distribution using R, without using R's runif function

Generating data from a uniform distribution using R, without using the runif function

One can easily generate data from a uniform(0,1) using the runif function in R:

runif(10)

##  [1] 0.25873184 0.06723362 0.07725857 0.65281945 0.43817895 0.35372059
##  [7] 0.14399150 0.16840633 0.24538047 0.95230596

But what if one doesn’t have this function and one needs to generate samples from a uniform(0,1)? In rejection sampling, one needs access to uniform(0,1) .

Here is one way to generate uniform data.

Generating samples from a uniform(0,1)

Samples from a uniform can be generated using the linear congruent generator algorithm (https://en.wikipedia.org/wiki/Linear_congruential_generator).

Here is the code in R.

pseudo_unif<-function(mult=16807,
                      mod=(2^31)-1,
                      seed=123456789,
                      size=100000){
  U<-rep(NA,size)
  x<-(seed*mult+1)%%mod
  U[1]<-x/mod
  for(i in 2:size){
    x<-(x*mult+1)%%mod
    U[i]<-x/mod
  }
  return(U)
}

u<-pseudo_unif()
hist(u,freq=FALSE)

For generating data from any range going from min to max:

gen_unif<-function(low=0,high=100,seed=987654321,
                   size=10000){
  low + (high-low)*pseudo_unif(seed=seed,size=size)
}

hist(gen_unif(),freq=FALSE)

The above code is based on: https://towardsdatascience.com/how-to-generate-random-variables-from-scratch-no-library-used-4b71eb3c8dc7

Tuesday, December 14, 2021

New paper: Syntactic and semantic interference in sentence comprehension: Support from English and German eye-tracking data

This paper is part of a larger project that has been running for 4-5 years, on the predictions of cue-based retrieval theories. This paper revisits Van Dyke 2007's design, using eye-tracking (the data are from comparable designs in English and German). The reading time patterns are consistent with syntactic interference at the moment of retrieval in both English. Semantic interference shows interesting differences between English and German---in English, semantic interference seems to happen simultaneously with syntactic interference, but in German, semantic interference is delayed (it appears in the post-critical region). The morphosyntactic properties of German could be driving the lag in semantic interference. We also discuss the data in the context of the quantitative predictions from the Lewis & Vasishth cue-based retrieval model.

One striking fact about psycholinguistics in general and interference effects in particular is that most of the data tend to come from English. Very few people work on non-English languages. I bet there are a lot of surprises in store for us once we step out of the narrow confines of English. I bet that most theories of sentence processing are overfitted to English and will not scale. And if you submit a paper to a journal using data from a non-English language, there will always be a reviewer or editor who asks you to explain why you chose language X!=English, and not English. Nobody ever questions you if you study English. A bizarre world.

Title: Syntactic and semantic interference in sentence comprehension: Support from English and German eye-tracking data

Abstract:

A long-standing debate in the sentence processing literature concerns the time course of syntactic and semantic information in online sentence comprehension. The default assumption in cue-based models of parsing is that syntactic and semantic retrieval cues simultaneously guide dependency resolution. When retrieval cues match multiple items in memory, this leads to similarity-based interference. Both semantic and syntactic interference have been shown to occur in English. However, the relative timing of syntactic vs. semantic interference remains unclear. In this first-ever cross-linguistic investigation of the time course of syntactic vs. semantic interference, the data from two eye-tracking reading experiments (English and German) suggest that the two types of interference can in principle arise simultaneously during retrieval. However, the data also indicate that semantic cues may be evaluated with a small timing lag in German compared to English. This suggests that there may be cross-linguistic variation in how syntactic and semantic cues are used to resolve linguistic dependencies in real-time.

Download pdf from here: https://psyarxiv.com/ua9yv

New paper in Computational Brain and Behavior: Sample size determination for Bayesian hierarchical models commonly used in psycholinguistics

We have just had a paper accepted in the journal Computational Brain and Behavior. This is part of a special issue that responds to the following paper on linear mixed models:
van Doorn, J., Aust, F., Haaf, J.M. et al. Bayes Factors for Mixed Models. Computational Brain and Behavior (2021). https://doi.org/10.1007/s42113-021-00113-2
There are quite a few papers in that special issue, all worth reading, but I especially liked the contribution by Singmann et al: Statistics in the Service of Science: Don't let the Tail Wag the Dog (https://psyarxiv.com/kxhfu/) They make some very good points in reaction to van Doorn et al's paper.

Our paper: Shravan Vasishth, Himanshu Yadav, Daniel J. Schad, and Bruno Nicenboim. Sample size determination for Bayesian hierarchical models commonly used in psycholinguistics. Computational Brain and Behavior, 2021.
Abstract: We discuss an important issue that is not directly related to the main theses of the van Doorn et al. (2021) paper, but which frequently comes up when using Bayesian linear mixed models: how to determine sample size in advance of running a study when planning a Bayes factor analysis. We adapt a simulation-based method proposed by Wang and Gelfand (2002) for a Bayes-factor based design analysis, and demonstrate how relatively complex hierarchical models can be used to determine approximate sample sizes for planning experiments.
Code and data: https://osf.io/hjgrm/
pdf: here

Tuesday, December 07, 2021

New paper accepted in MIT Press Journal Open Mind: Individual differences in cue weighting in sentence comprehension: An evaluation using Approximate Bayesian Computation

My PhD student Himanshu Yadav has just had an important paper on modeling individual differences provisionally accepted in the open access journal Open Mind. One reason that this paper is important is that it demonstrates why it is crucial to understand systematic individual-level behavior in the data, and what this observed data implies for computational models of sentence processing. As Blastland and Spiegelhalter put it, "The average is an abstraction. The reality is variation." Our focus should be on understanding and explaining the variation, not just average behavior. More exciting papers on this topic are coming soon from Himanshu!

The reviews from Open Mind were very high quality, certainly as high or higher quality than I have received from many top closed-access journals over the last 20 years. The journal has a top-notch editorial board, led by none other than Ted Gibson. This is our second paper in Open Mind; the first was this one. I plan to publish more of our papers in this journal (along with the other open access journal, Glossa Psycholinguistics, also led by a stellar set of editors, Fernanda Ferreira and Brian Dillon). I hope that these open access journals can become the norm for our field. I wonder what it will take for that to happen.

Himanshu Yadav, Dario Paape, Garrett Smith, Brian W. Dillon, and Shravan Vasishth. Individual differences in cue weighting in sentence comprehension: An evaluation using Approximate Bayesian Computation. Open Mind, 2021. Provisionally accepted.

The pdf is here.

Monday, December 06, 2021

New paper: Similarity-based interference in sentence comprehension in aphasia: A computational evaluation of two models of cue-based retrieval.

My PhD student Paula Lissón has just submitted this important new paper for review to a journal. This paper is important for several reasons but the most important one is that it's the first to quantitatively compare two competing computational models of retrieval in German sentence processing using data from unimpaired controls and individuals with aphasia. The work is the culmination of four years of hard work involving collecting a relatively large data-set (this amazing feat was achieved by Dorothea Pregla, and documented in a series of papers she has written, for example see this one in Brain and Language), and then developing computational models in Stan to systematically evaluate competing theoretical claims. This line of work should raise the bar in psycholinguistics when it comes to testing predictions of different theories. It is pretty common in psycholinguistics to wildly wave one's hands and say things like "sentence processing in individuals with aphasia is just noisy", and be satisfied with that statement and then publish it as a big insight into sentence processing difficulty. An important achievement of Paula's work, which builds on Bruno Nicenboim's research on Bayesian cognitive modeling, is to demonstrate how to nail down the claim and how to test it quantitatively. It seems kind of obvious that one should do that, but surprisingly, this kind of quantitative evaluation of models is still relatively rare in the field.

Title: Similarity-based interference in sentence comprehension in aphasia: A computational evaluation of two models of cue-based retrieval.

Abstract: Sentence comprehension requires the listener to link incoming words with short-term memory representations in order to build linguistic dependencies. The cue-based retrieval theory of sentence processing predicts that the retrieval of these memory representations is affected by similarity-based interference. We present the first large-scale computational evaluation of interference effects in two models of sentence processing – the activation-based model, and a modification of the direct-access model – in individuals with aphasia (IWA) and control participants in German. The parameters of the models are linked to prominent theories of processing deficits in aphasia, and the models are tested against two linguistic constructions in German: Pronoun resolution and relative clauses. The data come from a visual-world eye-tracking experiment combined with a sentence-picture matching task. The results show that both control participants and IWA are susceptible to retrieval interference, and that a combination of theoretical explanations (intermittent deficiencies, slow syntax, and resource reduction) can explain IWA’s deficits in sentence processing. Model comparisons reveal that both models have a similar predictive performance in pronoun resolution, but the activation-based model outperforms the direct-access model in relative clauses.

Download: here. Paula also has another paper modeling English data from unimpaired controls and individuals in aphasia, in Cognitive Science.

Monday, November 22, 2021

A confusing tweet on (not) transforming data keeps reappearing on the internet

I keep seeing this misleading comment on the internet over and over again:

Non-normality is relatively unimportant; at worst you just may lose a bit of power. I strongly recommend @StatModeling & Hill (2007, pp. 45-47)'s summary of key regression model assumptions. Normality of errors literally gets LOWEST priority. My experience supports this. 3/3 pic.twitter.com/R0BfQCoxdK
— Roger Levy (@roger_p_levy) December 8, 2018

Gelman is cited above, but Gelman himself has spoken out on this point and directly contradicts the above tweet: https://statmodeling.stat.columbia.edu/2019/08/21/you-should-usually-log-transform-your-positive-data/
Even the quoted part from the Gelman and Hill 2007 book is highly misleading because it is most definitely not about null hypothesis significance testing.
Non-normality is relatively unimportant in statistical data analysis the same way that a cricket ball is relatively unimportant in a cricket match. The players, the pitch, the bat, are much more important, but everyone would look pretty silly on the cricket field without that ball.
I guess if we really, really need a slogan to be able to do data analysis, it should be what one should call the MAM principle: model assumptions matter.

Friday, November 12, 2021

Book: Sentence comprehension as a cognitive process: A computational approach (Vasishth and Engelmann)

My book with Felix Engelmann has just been published. It puts together in one place 20 years of research on retrieval models, carried out by my students, colleagues, and myself.

Sunday, October 10, 2021

New paper: When nothing goes right, go left: A large-scale evaluation of bidirectional self-paced reading

Here's an interesting and important new paper led by the inimitable Dario Paape:

Title: When nothing goes right, go left: A large-scale evaluation of bidirectional self-paced reading

Download from: here.

Abstract:

In two web-based experiments, we evaluated the bidirectional self-paced reading (BSPR) paradigm recently proposed by Paape and Vasishth (2021). We used four sentence types: NP/Z garden-path sentences, RRC garden-path sentences, sentences containing inconsistent discourse continuations, and sentences containing reflexive anaphors with feature-matching but grammatically unavailable antecedents. Our results show that regressions in BSPR are associated with a decrease in positive acceptability judgments. Across all sentence types, we observed online reading patterns that are consistent with the existing eye-tracking literature. NP/Z but not RRC garden-path sentences also showed some indication of selective rereading, as predicted by the selective reanalysis hypothesis of Frazier and Rayner (1982). However, selective rereading was associated with decreased rather than increased sentence acceptability, which is not in line with the selective reanalysis hypothesis. We discuss the implications regarding the connection between selective rereading and conscious awareness, and for the use of BSPR in general.

Thursday, September 30, 2021

New paper on the reproducibility of JML articles (2019-21) after the open data policy was introduced

New paper by Anna Laurinavichyute and me:

The (ir)reproducibility of published analyses: A case study of 57 JML articles published between 2019 and 2021

Download from: https://psyarxiv.com/hf297/

Monday, September 20, 2021

Special issue in the journal Linguistics on the Replication Crisis

Friday, September 17, 2021

Applications are open: 2022 summer school on stats methods for ling and psych

Applications are now open for the sixth SMLP summer school, to be held in person (hopefully) in the Griebnitzsee campus of the University of Potsdam, Germany, 12-16 Sept 2022.

Apply here: https://vasishth.github.io/smlp2022/

Saturday, August 14, 2021

SAFAL 2: The Second South Asian Forum on the Acquisition and Processing of Language (30-31 August, 2021)

SAFAL 2: The Second South Asian Forum on the Acquisition and Processing of Language (30-31 August, 2021

Details: https://sites.google.com/view/safal2021/home

The first South Asian Forum on the Acquisition and Processing of Language (SAFAL) highlighted the need to provide a platform for showcasing and discussing acquisition and processing research in the context of South Asian languages. The second edition aims to build on this endeavour.

Following the first edition, the Second South Asian Forum on the Acquisition and Processing of Language (SAFAL) aims to provide a platform to exchange research on sentence processing, verbal semantics, computational modeling, corpus-based psycholinguistics, neurobiology of language, and child language acquisition, among others, in the context of the subcontinent's linguistic landscape.

Invited speakers:

Sakshi Bhatia is an Assistant Professor of Linguistics at the Central University of Rajasthan. Her research areas include syntax, psycholinguistics and the syntax-psycholinguistics interface

Kamal Choudhary is an Assistant Professor of Linguistics at the Indian Institute of Technology Ropar. His research areas include neurobiology of language and syntactic typology.

Shravan Vasishth's Slog (Statistics blog)

Search