Friday, January 02, 2015

A weird and unintended consequence of Barr et al's Keep It Maximal paper

Barr et al's well-intentioned paper is starting to lead to some seriously weird behavior in psycholinguistics! As a reviewer, I'm seeing submissions where people take the following approach:

1. Try to fit a "maximal" linear mixed model.  If you get a convergence failure (this happens a lot since we routinely run low power studies!), move to step 2.

By the way, the word maximal is ambiguous here, because you can have a "maximal" model with no correlation parameters estimated, or have one with correlations estimated. For a 2x2 design, the difference would look like:

correlations estimated: (1+factor1+factor2+interaction|subject) etc.

no correlations estimated: (factor1+factor2+interaction || subject) etc.

Both options can be considered maximal.]

2. Fit a repeated measures ANOVA. This means that you average over items to get F1 scores in the by-subject ANOVA. But this is cheating and amounts to p-value hacking. This effectively changes the between items variance to 0 because we aggregated over items for each subject in each condition. That is the whole reason why linear mixed models are so important; we can take both between item and between subject variance into account simultaneously. People mistakenly think that the linear mixed model and rmANOVA are exactly identical. If your experiment design calls for crossed varying intercepts and varying slopes (and it always does in psycholinguistics), an rmANOVA is not identical to the LMM, for the reason I give above. In the old days we used to compute minF.  In 2014, I mean, 2015, it makes no sense to do that if you have a tool like lmer.

As always, I'm happy to get comments on this.

Sunday, November 30, 2014

Misunderstanding p-values

These researchers did a small between-patient study with low power to compare people on 24 hours of dialysis vs 12 hours of dialysis a week. They found that patients in the 24 hour arm had improved blood pressure (reduced intake of BP meds in the 24 hour arm), improved potassium and phosphate levels, and found no significant differences in a quality of life questionnaire given to the two arms. From this, the main conclusion they present is that (italics mine) "extending weekly dialysis hours for 12 months did not improve quality of life, but was associated with improvement of some laboratory parameters and reduced blood pressure requirement."

If medical researchers can't even figure out what they can conclude from a null result from a low powered study, they should not be allowed to do such studies. I also looked at the quality of life questionnaire they used. This questionnaire doesn't even begin to address important indicators of the quality of life of a patient on hemodialysis. A lot depends on the type of life the patient on dialysis was leading before he/she got into the study; what he/she does for a living (if anything), what other health problems he/she has,... These are the things that the questionnaire would measure; the questionnaire doesn't even tackle relevant quality of life variables associated with increased dialysis.

So, not only did they draw the wrong conclusion from their null result, the instrument they are using is not even the appropriate one. It would still have been just fine if they had not written "extending weekly dialysis hours for 12 months did not improve quality of life."

What a waste of money and time this is. It is really disappointing that such poor research passes the rigorous peer review of the Journal of the American Society of Nephrology. Here is what they say in their abstracts book:

"Abstract submissions were rigorously reviewed and graded by multiple experts."

What the journal needs is statisticians reading and vetting these abstracts.


Response to John Kruschke

I wanted to post this reply to John Kruschke's blog post, but the blog comment box does not allow such a long response, so I posted it on my own blog and will link it in the comment box:

Hi John,

thanks for the detailed responses, and for the friendly tone of your response, I appreciate it.

I will try to write a more detailed review of the book to give some suggestions for the next edition, but I just wanted to respond to your comments:

1. Price: I agree that it's relative. But your argument assumes a US audience; people are often willing to pay outrageous amounts for things that are priced much more reasonably (and realistically) in Europe. Is the book primarily targeted to the US population? If not, the price is unreasonable. I cannot ask my students to buy this book when much cheaper ones exist. Even Gelman et al release slides that cover the entire or a substantial part of the  BDA book. The analogy with calculus book is not valid either; Gilbert Strang's calculus book is available free on the internet, and there are many other free textbooks of very high quality. For statistics, there's Kerns, Michael Lavine's book, and for probability there are several great books available for free.

This book is more accessible than BDA and could become the standard text in psycholinguistics/psychology/linguistics. Why not halve the price and make it easier to get hold of? Even better, release a free version on the web. I could then even set it as a textbook in my courses, and I would.

2. Regarding the frequentist discussion, you wrote: "The vast majority of users of traditional frequentist statistics don't know why they should bother with taking the effort to learn Bayesian methods." 


"Again, I think it's important for beginners to see the contrast with frequentist methods, so that they know why to bother with Bayesian methods."

My objection is that the criticism of frequentist methods is not the primary motivation for using Bayesian methods. I agree that people don't understand p-values and CIs. But the solution to that is to educate them so they understand them, the motivation for using Bayes cannot be that people don't understand frequentist methods
and/or abuse them. The next step would be to not use Bayesian methods because people who use it don't understand them and/or abuse them.

The primary motivation for me for using Bayes is the astonishing flexibility of Bayesian tools. It's not the only motivation, but this one thing outweighs everything else for me.

Also, even if the user of frequentist statistics realizes the problems inherent in the abuse of frequentist tools, this alone won't be sufficient to motivate them to move to Bayesian statistics. A more inclusive philosophy would be more effective: for some things a frequentist method is just fine (used properly). For other things you really need Bayes. You don't always need a laser gun; there are times when a hammer would do just fine (my last sentence does not do justice to frequentist tools, which are often really sophisticated).

3. "If anything, I find that adherence to frequentist methods require more blind faith than Bayesian methods, which to me just make rational sense. To the extent there is any tone of zealotry in my writing, it's only because the criticisms of p values and confidence intervals can come as a bit of a revelation after years of using p values without really understanding them."

I understand where you are coming from; I have also taken the same path of slowly coming to understand what the methodology was really saying, and initially I also fell into the trap of getting annoyed with frequentist methods and rejecting them outright.

But I have reconsidered my position and I think Bayes should be presented on its own merits. I can see that relating Bayes and freq. methods is necessary to clarify the differences, but this shouldn't run out of control. In my future courses that is the line I am going to take.

When I read material attacking frequentist methods *as a way to get to Bayes*, I am strongly reminded of the gurus in India who use a similar  strategy to make their new converts believe in them and drive out any loyalty to the old guru.   That is where my analogy to religion is coming from.  It's an old method, and I have seen religious zealots espousing "the one right way" using it.

4. "Well, yes, that is a major problem. But I don't think it's the only major problem. I think most users of frequentist methods don't understand what a p value and confidence interval really are. "

Often, these are the same thing. They are abused by many people because they don't understand them. An example is psycholinguistics, where we routinely publish null results in low power experiments as positive findings. The people who do that are not abusing statistics deliberately, they just don't know that a null result is not informative in their particular settings. Journal editors (top journals) think that a lower p-value gives you more evidence in favor of the specific alternative. They just don't understand it, but they are not involved in deception.

The set of people who understand the method and deliberately abuse it is probably nearly the empty set.  I don't know anyone in psycholinguistics who understands p-values and CIs and still abuses the method.

I'll write more later (and I have many positive comments!) once I've finished reading your 700+ page book! :)

Tuesday, November 25, 2014

Should we fit maximal linear mixed models?

Recently, Barr et al published a paper in the Journal of Memory and Language, arguing that we should fit maximal linear mixed models, i.e., fit models that have a full variance-covariance matrix specification for subject and for items. I suggest here that the recommendation should not be to fit maximal models, the recommendation should be to run high power studies.

I released a simulation on this blog some time ago arguing that the correlation parameters are pretty meaningless.  Dale Barr and Jake Westfall replied to my post, raising some interesting points. I have to agree with Dale's point that we should reflect the design of the experiment in the analysis; after all, our goal is to specify how we think the data were generated. But my main point is that given the fact that the culture in psycholinguistics is to run low power studies (we routinely publish null results with low power studies and present them as positive findings), fitting maximal models without asking oneself whether the various parameters are reasonably estimable will lead us to miss effects.

For me, the only useful recommendation to psycholinguists should be to run high power studies.

Consider two cases:

1. Run a low power study (the norm in psycholinguistics) where the null hypothesis is false.

If you blindly fit a maximal model, you are going to miss detecting the effect more often compared to when you fit a minimal model (varying intercepts only). For my specific example below, the proportions of false negatives is 38% (maximal) vs 9% (minimal).

In the top figure, we see that under repeated sampling, lmer is failing to estimate the true correlations for items (it's doing a better job for subjects because there is more data for subjects). Even though these are nuisance parameters, trying to estimate them for items in this dataset is a meaningless exercise (and the fact that the parameterization is going to influence the correlations is not the key issue here---that decision is made based on the hypotheses to be tested).

The lower figure shows that under repeated sampling, the effect (\mu is positive here, see my earlier post for details) is being missed much more often with a maximal model (black lines, 95% CIs) than with a varying intercepts model (red lines). The difference is in the miss probability is 38% (maximal) vs 9% (minimal).

2. Run a high power study.

Now, it doesn't really matter whether you fit a maximal model or not. You're going to detect the effect either way. The upper plot shows that under repeated sampling, lmer will tend to detect the true correlations correctly. The lower plot shows that in almost 100% of the cases, the effect is detected regardless of whether we fit a maximal model (black lines) or not (red lines).

My conclusion is that if we want to send a message regarding best practice to psycholinguistics, it should not be to fit maximal models. It should be to run high power studies. To borrow a phrase from Andrew Gelman's blog (or from Rob Weiss's), if you are running low power studies, you are leaving money on the table.

Here's my code to back up what I'm saying here. I'm happy to be corrected!

Saturday, November 22, 2014

Simulating scientists doing experiments

Following a discussion on Gelman's blog, I was playing around with simulating scientists looking for significant effects. Suppose each of 1000 scientists run 200 experiments in their lifetime, and suppose that 20% of the experiments are such that the null is true. Assume a low power experiment (standard in psycholinguistics; eyetracking studies even in journals like JML can easily have something like 20 subjects). E.g., with a sample size of 1000, delta of 2, and sd of 50, we have power around 15%. We will add the stringent condition that the scientist has to get one replication of a significant effect before they publish it.

What is the proportion of scientists that will publish at least one false positive in their lifetime? That was the question. Here's my simulation. You can increase the effect_size to 10 from 2 to see what happens in high power situations.

Comments and/or corrections are welcome.

Saturday, August 23, 2014

An adverse consequence of fitting "maximal" linear mixed models

Distribution of intercept-slope correlation estimates with 37 subjects, 15 items

Distribution of intercept-slope correlation estimates with 50 subjects, 30 items
Should one always fit a full variance covariance matrix (a "maximal" model) when one analyzes repeated measures data-sets using linear mixed models? Here, I present one reason why blindly fitting ''maximal'' models does not make much sense.

Let's create a repeated measures data-set that has two conditions (we want to keep this example simple), and the following underlying generative distribution, which is estimated from the Gibson and Wu 2012 (Language and Cognitive Processes) data-set. The dependent variable is reading time (rt).

rt_{i} = \beta_0 + u_{0j} + w_{0k} + (\beta_1 + u_{1j} + w_{1k}) \hbox{x}_i + \epsilon_i

  u_{0j} \\
  0 \\
  w_{0k} \\
  w_{1k} \\
N \left(
  0 \\

\Sigma_u =
\left[ \begin{array}{cc}
\sigma_{\mathrm{u0}}^2 & \rho_u \, \sigma_{u0} \sigma_{u1}  \\
\rho_u \, \sigma_{u0} \sigma_{u1} & \sigma_{u1}^2\end{array} \right]

\Sigma_w =
\left[ \begin{array}{cc}
\sigma_{\mathrm{w0}}^2 & \rho_w \, \sigma_{w0} \sigma_{w1}  \\
\rho_w \, \sigma_{w0} \sigma_{w1} & \sigma_{w1}^2\end{array} \right]

\epsilon_i \sim N(0,\sigma^2)

One difference from the Gibson and Wu data-set is that each subject is assumed to see each instance of each item (like in the old days of ERP research), but nothing hinges on this simplification; the results presented will hold regardless of whether we do a Latin square or not (I tested this).

The  parameters and sample sizes are assumed to have the following values:

* $\beta_1$=487
* $\beta_2$= 61.5

* $\sigma$=544
* $\sigma_{u0}$=160
* $\sigma_{u1}$=195
* $\sigma_{w0}$=154
* $\sigma_{w1}$=142
* $\rho_u=\rho_w$=0.6
* 37 subjects
* 15 items

Next, we generate data 100 times using the above parameter and model specification, and estimate (from lmer) the parameters each time. With the kind of sample size we have above, a maximal model does a terrible job of estimating the correlation parameters $\rho_u=\rho_w$=0.6.

However, if we generate data 100 times using 50 subjects instead of 37, and 30 items instead of 15, lmer is able to estimate the correlations reasonably well.

In both cases we fit ''maximal'' models; in the first case, it makes no sense to fit a "maximal" model because the correlation estimates tend to be over-estimated. The classical method (the generalized likelihood ratio test (the anova function in lme4) to find the ''best'' model) for determining which model is appropriate is discussed in the Pinheiro and Bates book, and would lead us to adopt a simpler model in the first case.

 Douglas Bates himself has something to say on this topic:

As Bates puts it:

"Estimation of variance and covariance components requires a large number of groups. It is important to realize this. It is also important to realize that in most cases you are not terribly interested in precise estimates of variance components. Sometimes you are but a substantial portion of the time you are using random effects to model subject-to-subject variability, etc. and if the data don't provide sufficient subject-to-subject variability to support the model then drop down to a simpler model. "

Here is the code I used:

Tuesday, December 17, 2013

lmer vs Stan for a somewhat involved dataset.

Here is a comparison of lmer vs Stan output on a mildly complicated dataset from a psychology expt. (Kliegl et al 2011). The data are here:

The data and paper available from:

I should say that datasets from psychology and psycholinguistic can be much more complicated than this. So this was only a modest test of Stan.

The basic result is that I was able to recover in Stan the parameter estimates (fixed effects) that were primarily of interest, compared to the lmer output. The sds of the variance components all come out pretty much the same in Stan vs lmer. The correlations estimated in Stan are much smaller than lmer, but this is normal: the bayesian models seem to be more conservative when it comes to estimating correlations between random effects.

Traceplots are here:

They look generally fine to me.

One very important fact about lmer vs Stan is that lmer took 23 seconds to return an answer, but Stan took 18,814 seconds (about 5 hours), running 500 iterations and 2 chains.

One caveat is that I do have to try to figure out how to speed up Stan so that we get the best performance out of it that is possible.

Monday, December 16, 2013

The most common linear mixed models in psycholinguistics, using JAGS and Stan

As part of my course in bayesian data analysis, I have put up some common linear mixed models that we fit in psycholinguistics. These are written in JAGS and Stan. Comments and suggestions for improvement are most welcome.