Showing posts with label learning statistics. Show all posts

Monday, January 30, 2023

Introduction to Bayesian Data Analysis: Video lectures now available on youtube

These recordings are part of a set of videos that are available from the free four-week online course Introduction to Bayesian Data Analysis, taught over the openhpi.de portal.

Tuesday, October 04, 2022

Applications open: The Seventh Summer School on Statistical Methods for Linguistics and Psychology, 11-15 September 2023

Applications are open (till 1st April 2023( for the seventh summer school on statistical methods for linguistics and psychology, to be held in Potsdam, Germany.

Summer school website: https://vasishth.github.io/smlp2023/

Some of the highlights:

1. Four parallel courses on frequentist and Bayesian methods (introductory/intermediate and advanced)

2. A special short course on Bayesian meta-analysis by Dr. Robert Grant of bayescamp.

3. You can also do this free, completely online four-week course on Introduction to Bayesian Data Analysis (starts Jan 2023): https://open.hpi.de/courses/bayesian-statistics2023

Thursday, September 08, 2022

Free MOOC course at openHPI.de: Introduction to Bayesian Data Analysis (starts 25 Jan 2023)

Details here: https://open.hpi.de/courses/bayesian-statistics2023

Friday, May 27, 2022

Summer School “Methods in Language Sciences” (16-20 August 2022, Ghent, Belgium): Registrations open

I was asked to advertise this summer school (I will be teaching a 2.5 day course on linear mixed modeling, and will give a keynote lecture on the use of Bayesian methods in linguistics/psychology). The text below is from the organizers.

Summer School “Methods in Language Sciences” 2022:

Registrations are open

Top quality research requires outstanding methodological skills. That is why the Department

of Linguistics and the Department of Translation, Interpreting and Communication of Ghent

University will jointly organize the (second edition of the) Summer School “Methods in

Language Sciences” on 16-20 August 2022.

This Summer School is targeted at both junior and senior researchers and offers nine multi-

day modules on various topics, ranging from quantitative to qualitative methods and

covering introductory and advanced statistical analysis, Natural Language Processing

(NLP), eye-tracking, survey design, ethnographic methods, as well as specific tools such

as PRAAT and ELAN. In 2022 we have a new module on Linear Mixed Models. All lecturers

are internationally recognized experts with a strong research and teaching background.

Because the modules will partly be held in parallel sessions, participants have to choose one

or two modules to follow (see the Programme for details). No prerequisite knowledge or

experience is required, except for Modules 2 and 9, which deal with advanced statistical data

analysis.

We are proud to welcome two keynote speakers at this year’s summer school: Shravan

Vasishth and Crispin Thurlow, who both also act as lecturers.

This is your opportunity to take your methodological skills for research in (applied)

linguistics, translation or interpreting studies to the next level. We are looking forward to

meeting you in Ghent!

Saturday, April 16, 2022

Ever wondered how the probability of the null hypothesis being true changes given a significant result?

TRIGGER WARNING: These simulations might fundamentally shake your belief system. USE WITH CARE.

In a recently accepted paper in the open access journal Quantitative Methods for Psychology that Daniel Schad led, we discuss how, using Bayes' rule, one can explore the change in the probability of a null hypothesis being true (call it theta) when you get a significant effect. The paper, which was inspired by a short comment in McElreath's book (first edition), shows that theta does not necessarily change much even if you get a significant result. The probability theta can change dramatically under certain conditions, but those conditions are either so stringent or so trivial that it renders many of the significance-based conclusions in psychology and psycholinguistics questionable at the very least.

You can do your own simulations, under assumptions that you consider more appropriate for your own research problem, using this shiny app (below), or play with the source code: here.

Wednesday, March 23, 2022

New paper: Some right ways to analyze (psycho)linguistic data

New paper (under review):

Title: Some right ways to analyze (psycho)linguistic data

Abstract:

Much has been written on the abuse and misuse of statistical methods, including p-values, statistical significance, etc. I present some of the best practices in statistics using a running example data analysis. Focusing primarily on frequentist and Bayesian linear mixed models, I illustrate some defensible ways in which statistical inference—specifically, hypothesis testing using Bayes factors vs. estimation or uncertainty quantification—can be carried out. The key is to not overstate the evidence and to not expect too much from statistics. Along the way, I demonstrate some powerful ideas, the most important ones being using simulation to understand the design properties of one’s experiment before running it, visualizing data before carrying out a formal analysis, and simulating data from the fitted model to understand the model’s behavior.

PDF: https://psyarxiv.com/y54va/

Summer School on Statistical Methods for Linguistics and Psychology, Sept. 12-16, 2022 (applications close April 1)

The Sixth Summer School on Statistical Methods for Linguistics and Psychology will be held in Potsdam, Germany, September 12-16, 2022. Like the previous editions of the summer school, this edition will have two frequentist and two Bayesian streams. Currently, this summer school is being planned as an in-person event.

The application form closes April 1, 2022. We will announce the decisions on or around April 15, 2022.

Course fee: There is no fee because the summer school is funded by the Collaborative Research Center (Sonderforschungsbereich 1287). However, we will charge 40 Euros to cover costs for coffee and snacks during the breaks and social hours. And participants will have to pay for their own accommodation.

For details, see: https://vasishth.github.io/smlp2022/

Curriculum:

1. Introduction to Bayesian data analysis (maximum 30 participants). Taught by Shravan Vasishth, assisted by Anna Laurinavichyute, and Paula Lissón

This course is an introduction to Bayesian modeling, oriented towards linguists and psychologists. Topics to be covered: Introduction to Bayesian data analysis, Linear Modeling, Hierarchical Models. We will cover these topics within the context of an applied Bayesian workflow that includes exploratory data analysis, model fitting, and model checking using simulation. Participants are expected to be familiar with R, and must have some experience in data analysis, particularly with the R library lme4.
Course Materials Previous year's course web page: all materials (videos etc.) from the previous year are available here.
Textbook: here. We will work through the first six chapters.

2. Advanced Bayesian data analysis (maximum 30 participants). Taught by Bruno Nicenboim, assisted by Himanshu Yadav

This course assumes that participants have some experience in Bayesian modeling already using brms and want to transition to Stan to learn more advanced methods and start building simple computational cognitive models. Participants should have worked through or be familiar with the material in the first five chapters of our book draft: Introduction to Bayesian Data Analysis for Cognitive Science. In this course, we will cover Parts III to V of our book draft: model comparison using Bayes factors and k-fold cross validation, introduction and relatively advanced models with Stan, and simple computational cognitive models.

Course Materials Textbook here. We will start from Part III of the book (Advanced models with Stan). Participants are expected to be familiar with the first five chapters.

3. Foundational methods in frequentist statistics (maximum 30 participants). Taught by Audrey Buerki, Daniel Schad, and João Veríssimo.

Participants will be expected to have used linear mixed models before, to the level of the textbook by Winter (2019, Statistics for Linguists), and want to acquire a deeper knowledge of frequentist foundations, and understand the linear mixed modeling framework more deeply. Participants are also expected to have fit multiple regressions. We will cover model selection, contrast coding, with a heavy emphasis on simulations to compute power and to understand what the model implies. We will work on (at least some of) the participants' own datasets. This course is not appropriate for researchers new to R or to frequentist statistics.

Course Materials Textbook draft here.

4. Advanced methods in frequentist statistics with Julia (maximum 30 participants). Taught by Reinhold Kliegl, Phillip Alday, Julius Krumbiegel, and Doug Bates.
Applicants must have experience with linear mixed models and be interested in learning how to carry out such analyses with the Julia-based MixedModels.jl package) (i.e., the analogue of the R-based lme4 package). MixedModels.jl has some significant advantages. Some of them are: (a) new and more efficient computational implementation, (b) speed — needed for, e.g., complex designs and power simulations, (c) more flexibility for selection of parsimonious mixed models, and (d) more flexibility in taking into account autocorrelations or other dependencies — typical EEG-, fMRI-based time series (under development). We do not expect profound knowledge of Julia from participants; the necessary subset of knowledge will be taught on the first day of the course. We do expect a readiness to install Julia and the confidence that with some basic instruction participants will be able to adapt prepared Julia scripts for their own data or to adapt some of their own lme4-commands to the equivalent MixedModels.jl-commands. The course will be taught in a hybrid IDE. There is already the option to execute R chunks from within Julia, meaning one needs Julia primarily for execution of MixedModels.jl commands as replacement of lme4. There is also an option to call MixedModels.jl from within R and process the resulting object like an lme4-object. Thus, much of pre- and postprocessing (e.g., data simulation for complex experimental designs; visualization of partial-effect interactions or shrinkage effects) can be carried out in R.
Course Materials Github repo: here.

Thursday, February 03, 2022

EMLAR 2022 tutorial on Bayesian methods

At EMLAR 2022 I will teach two sessions that will introduce Bayesian methods. Here is the abstract for the two sessions:

EMLAR 2022: An introduction to Bayesian data analysis

Taught by Shravan Vasishth (vasishth.github.io)

Session 1. Tuesday 19 April 2022, 1-3PM (Zoom link will be provided)

Modern probabilistic programming languages like Stan (mc-stan.org)

have made Bayesian methods increasingly accessible to researchers

in linguistics and psychology. However, finding an entry point

into these methods is often difficult for researchers. In this

tutorial, I will provide an informal introduction to the

fundamental ideas behind Bayesian statistics, using examples

that illustrate applications to psycholinguistics.

I will also discuss some of the advantages of the Bayesian

approach over the standardly used frequentist paradigms:

uncertainty quantification, robust estimates through regularization,

the ability to incorporate expert and/or prior knowledge into

the data analysis, and the ability to flexibly define the

generative process and thereby to directly address the actual research

question (as opposed to a straw-man null hypothesis).

Suggestions for further reading will be provided. In this tutorial,

I presuppose that the audience is familiar with linear mixed models

(as used in R with the package lme4).

Session 2. Thursday 21 April 2022, 9:30-11:30 (Zoom link will be provided)

This session presupposed that the participant has attended

Session 1. I will show some case studies using brms and Stan

code that will demonstrate the major applications of

Bayesian methods in psycholinguistics. I will reference/use some of

the material described in this online textbook (in progress):

https://vasishth.github.io/bayescogsci/book/

Saturday, January 22, 2022

Review of Writing Science by Joshua Shiemel: Good advice on writing, but ignore his advice on statistical inference because it's just plain wrong

These days I am quite obsessed with figuring out how to improve the writing that is coming out of my lab. My postdocs generally produce solid writing, but my students struggle, just as I struggled when I was a student. So I bought a bunch of books written by experts to try to figure out what the best advice is out there on writing scientific articles. One of the very best books I have read is by Schimel:

Schimel, Joshua. Writing Science. Oxford University Press. Kindle Edition.

Schimel seems to be a heavy-weight in his field:

What I like most about his book is that he takes the attitude that the goal of writing is that people actually read your paper. He puts it differently: people should cite your paper. But I think he means that people should want to read your paper (it's unfortunately pretty common to cite someone's work without reading it, just because someone else cited it; one just copies over the citations).

His book treats writing as storytelling. A clear storyline has to be planned out before one puts pen to paper (or fingers to keyboard). Several types of openings are suggested, but the most sensible one for standard technical writing that is addressed to an expert audience is what he calls the OCAR style (the text below is quoted directly from his excellent blog: https://schimelwritingscience.wordpress.com/):

1. Opening: This should identify the larger problem you are contributing to, give readers a sense of the direction your paper is going, and make it clear why it is important. It should engage the widest audience practical. The problem may be applied or purely conceptual and intellectual—this is the reason you’re doing the work.

2. Challenge: What is your specific question or hypothesis? You might have a few, but there is often one overarching question, which others flesh out.

3. Action: What are the key results of your work? Identify no more than 2-3 points.

4. Resolution: What is your central conclusion and take home message? What have you learned about nature? If readers remember only one thing from your work, this should be it. The resolution should show how the results (Action) answer the question in the Challenge, and how doing so helps solve the problem you identified in the Opening.

The book spends a lot of time unpacking these ideas, I won't repeat the details here.

One problem I had with his examples was that they all lie outside my area of expertise, so I couldn't really appreciate what a good vs bad style was when looking at his specific examples. I think such books really have to be written for people working in particular fields; the title should reflect that. There is an urgent need for such a book specifically for psycholinguistics, with examples from our own field. I don't think that a student of psycholinguistics can pick up this book and learn anything much from the examples. The high-level advice is great, but it's hard to translate into actionable things in one's own field.

I have one major complaint about this book: Schimel gives absurdly incorrect advice to the reader about how to present and draw inferences from statistical results. To me it is quite surprising that you can become just a senior and well-cited scientist in an empirically driven field, and have absolutely zero understanding of basic statistical concepts. Schimel would fail my intro stats 1 class.

Here is what he has to say (p 78 in my Kindle edition) about how to present statistical results. I bold-face the most egregious statements.

"As an example, consider figure 8.3. In panel A there is a large difference (the treatment is 2.3 x the control) that is unquestionably statistically significant. Panel B shows data with the same statistical significance ( p = 0.02), but the difference between the treatments is smaller. You could describe both of these graphs by saying, “The treatment significantly increased the response ( p = 0.02).” That would be true, but the stories in panels A and B are different — in panel A, there is a strong effect and in panel B, a weak one. I would describe panel A by saying, “The treatment increased the response by a factor of 2.3 ( p = 0.02)”; for panel B, I might write, “The treatment increased the response by only 30 percent, but this increase was statistically significant ( p = 0.02).”

Well, panel A is probably Type M error (just look at the uncertainty of the estimates compared to panel B), and what he calls a weak effect in panel B is more likely to be the accurate estimate (again, just look at those uncertainty intervals). So that's a very misleading statement to call A a strong effect and B a weak effect. If given data like in panels A and B, I would take panel B more seriously. I have ranted extensively about this point in a 2018 paper. And of course, others have long complained about this kind of misunderstanding (Gelman and Carlin, 2014).

But it gets worse. Here is what Schimel has to say about panel C. Again, I highlight the absurd part of his comments/advice:

"The tricky question is what to write about panel C. The difference between treatment and control is the same as in panel A (a factor of 2.3), but the data are more variable and so the statistics are weaker, in this case above the threshold that many use to distinguish whether there is a “significant” diff erence at all. Many would describe this panel by writing, “There was no significant effect of the treatment (p > 0.05).” Such a description, however, has several problems. The first problem is that many readers would infer that there was no difference between treatment and control. In fact though, they differed by a factor of 2.3. That is never the “same.” Also, with a p value of 0.07, the probability that the effect was due to the experimental treatment is still greater than 90 percent. Thus, a statement like this is probably making a Type II error — rejecting a real effect. The second problem is that just saying there was no significant effect mixes results and interpretation. When you do a statistical test, the F and p values are results . Deciding whether the test is significant is interpretation. When you describe the data solely in terms of whether the difference was significant, you present an interpretation of the data as the data, which violates an important principle of science. Any specific threshold for significance is an arbitrary choice with no fundamental basis in either science or statistics."

It is kind of fascinating, in a horrifying kind of way, to think that even today there are people out there who think that a p-value of 0.07 implies that the probability of the null being true is 0.07; he thinks that a p-value of 0.07 means that there is a 93% chance that the null is false, i.e., that the effect is real. To support the last sentence in the quote above, Schimel cites an introductory textbook written by statisticians: An introduction to the practice of statistics, by Moore and McCabe (who seem to be professional statisticians). I wanted to read this book to see what they say about p-values there, but it's not available as a Kindle edition and I can't be bothered to spend 80 Euros to get a hard copy.

Could it be that Schimel got his statistical education, such as it is, through misleading textbooks written by professional statisticians? Or did he just misunderstand what he read? I have no idea, but I find it depressing that such misleading and outright wrong recommendations can appear in a section on how to report one's results, and that this was written not by some obscure guy who knows nothing about nothing, but a leading scientist in his field.

Anyway, despite my complaints, overall the book is great and worth reading. One can get a lot out of his other advice on writing. Just ignore everything he says about statistics and consult someone else who actually know what they are talking about; maybe someone like Andrew Gelman. Gelman has written plenty on the topic of presenting one's data analyses and on statistical inference.

As mentioned above, Schimel also has a very cool blog (seems not to be currently in use) that has a lot of interesting and very readable posts: https://schimelwritingscience.wordpress.com/.

Sunday, December 19, 2021

Generating data from a uniform distribution using R, without using R's runif function

Generating data from a uniform distribution using R, without using the runif function

One can easily generate data from a uniform(0,1) using the runif function in R:

runif(10)

##  [1] 0.25873184 0.06723362 0.07725857 0.65281945 0.43817895 0.35372059
##  [7] 0.14399150 0.16840633 0.24538047 0.95230596

But what if one doesn’t have this function and one needs to generate samples from a uniform(0,1)? In rejection sampling, one needs access to uniform(0,1) .

Here is one way to generate uniform data.

Generating samples from a uniform(0,1)

Samples from a uniform can be generated using the linear congruent generator algorithm (https://en.wikipedia.org/wiki/Linear_congruential_generator).

Here is the code in R.

pseudo_unif<-function(mult=16807,
                      mod=(2^31)-1,
                      seed=123456789,
                      size=100000){
  U<-rep(NA,size)
  x<-(seed*mult+1)%%mod
  U[1]<-x/mod
  for(i in 2:size){
    x<-(x*mult+1)%%mod
    U[i]<-x/mod
  }
  return(U)
}

u<-pseudo_unif()
hist(u,freq=FALSE)

For generating data from any range going from min to max:

gen_unif<-function(low=0,high=100,seed=987654321,
                   size=10000){
  low + (high-low)*pseudo_unif(seed=seed,size=size)
}

hist(gen_unif(),freq=FALSE)

The above code is based on: https://towardsdatascience.com/how-to-generate-random-variables-from-scratch-no-library-used-4b71eb3c8dc7

Shravan Vasishth's Slog (Statistics blog)

Search