Tuesday, November 25, 2014

Should we fit maximal linear mixed models?

Recently, Barr et al published a paper in the Journal of Memory and Language, arguing that we should fit maximal linear mixed models, i.e., fit models that have a full variance-covariance matrix specification for subject and for items. I suggest here that the recommendation should not be to fit maximal models, the recommendation should be to run high power studies.

I released a simulation on this blog some time ago arguing that the correlation parameters are pretty meaningless.  Dale Barr and Jake Westfall replied to my post, raising some interesting points. I have to agree with Dale's point that we should reflect the design of the experiment in the analysis; after all, our goal is to specify how we think the data were generated. But my main point is that given the fact that the culture in psycholinguistics is to run low power studies (we routinely publish null results with low power studies and present them as positive findings), fitting maximal models without asking oneself whether the various parameters are reasonably estimable will lead us to miss effects.

For me, the only useful recommendation to psycholinguists should be to run high power studies.

Consider two cases:

1. Run a low power study (the norm in psycholinguistics) where the null hypothesis is false.

If you blindly fit a maximal model, you are going to miss detecting the effect more often compared to when you fit a minimal model (varying intercepts only). For my specific example below, the proportions of false negatives is 38% (maximal) vs 9% (minimal).







In the top figure, we see that under repeated sampling, lmer is failing to estimate the true correlations for items (it's doing a better job for subjects because there is more data for subjects). Even though these are nuisance parameters, trying to estimate them for items in this dataset is a meaningless exercise (and the fact that the parameterization is going to influence the correlations is not the key issue here---that decision is made based on the hypotheses to be tested).

The lower figure shows that under repeated sampling, the effect (\mu is positive here, see my earlier post for details) is being missed much more often with a maximal model (black lines, 95% CIs) than with a varying intercepts model (red lines). The difference is in the miss probability is 38% (maximal) vs 9% (minimal).



2. Run a high power study.




Now, it doesn't really matter whether you fit a maximal model or not. You're going to detect the effect either way. The upper plot shows that under repeated sampling, lmer will tend to detect the true correlations correctly. The lower plot shows that in almost 100% of the cases, the effect is detected regardless of whether we fit a maximal model (black lines) or not (red lines).

My conclusion is that if we want to send a message regarding best practice to psycholinguistics, it should not be to fit maximal models. It should be to run high power studies. To borrow a phrase from Andrew Gelman's blog (or from Rob Weiss's), if you are running low power studies, you are leaving money on the table.

Here's my code to back up what I'm saying here. I'm happy to be corrected!

https://gist.github.com/vasishth/42e3254c9a97cbacd490

Saturday, November 22, 2014

Simulating scientists doing experiments

Following a discussion on Gelman's blog, I was playing around with simulating scientists looking for significant effects. Suppose each of 1000 scientists run 200 experiments in their lifetime, and suppose that 20% of the experiments are such that the null is true. Assume a low power experiment (standard in psycholinguistics; eyetracking studies even in journals like JML can easily have something like 20 subjects). E.g., with a sample size of 1000, delta of 2, and sd of 50, we have power around 15%. We will add the stringent condition that the scientist has to get one replication of a significant effect before they publish it.

What is the proportion of scientists that will publish at least one false positive in their lifetime? That was the question. Here's my simulation. You can increase the effect_size to 10 from 2 to see what happens in high power situations.



Comments and/or corrections are welcome.


Saturday, August 23, 2014

An adverse consequence of fitting "maximal" linear mixed models

Distribution of intercept-slope correlation estimates with 37 subjects, 15 items

Distribution of intercept-slope correlation estimates with 50 subjects, 30 items
Should one always fit a full variance covariance matrix (a "maximal" model) when one analyzes repeated measures data-sets using linear mixed models? Here, I present one reason why blindly fitting ''maximal'' models does not make much sense.

Let's create a repeated measures data-set that has two conditions (we want to keep this example simple), and the following underlying generative distribution, which is estimated from the Gibson and Wu 2012 (Language and Cognitive Processes) data-set. The dependent variable is reading time (rt).

\begin{equation}\label{eq:ranslp2}
rt_{i} = \beta_0 + u_{0j} + w_{0k} + (\beta_1 + u_{1j} + w_{1k}) \hbox{x}_i + \epsilon_i
\end{equation}

\begin{equation}
\begin{pmatrix}
  u_{0j} \\
  u_{1j}
\end{pmatrix}
\sim
N\left(
\begin{pmatrix}
  0 \\
  0
\end{pmatrix},
\Sigma_{u}
\right)
\quad
\begin{pmatrix}
  w_{0k} \\
  w_{1k} \\
\end{pmatrix}
\sim
N \left(
\begin{pmatrix}
  0 \\
  0
\end{pmatrix},
\Sigma_{w}
\right)
\end{equation}


\begin{equation}\label{eq:sigmau}
\Sigma_u =
\left[ \begin{array}{cc}
\sigma_{\mathrm{u0}}^2 & \rho_u \, \sigma_{u0} \sigma_{u1}  \\
\rho_u \, \sigma_{u0} \sigma_{u1} & \sigma_{u1}^2\end{array} \right]
\end{equation}

\begin{equation}\label{eq:sigmaw}
\Sigma_w =
\left[ \begin{array}{cc}
\sigma_{\mathrm{w0}}^2 & \rho_w \, \sigma_{w0} \sigma_{w1}  \\
\rho_w \, \sigma_{w0} \sigma_{w1} & \sigma_{w1}^2\end{array} \right]
\end{equation}

\begin{equation}
\epsilon_i \sim N(0,\sigma^2)
\end{equation}

One difference from the Gibson and Wu data-set is that each subject is assumed to see each instance of each item (like in the old days of ERP research), but nothing hinges on this simplification; the results presented will hold regardless of whether we do a Latin square or not (I tested this).

The  parameters and sample sizes are assumed to have the following values:


* $\beta_1$=487
* $\beta_2$= 61.5

* $\sigma$=544
* $\sigma_{u0}$=160
* $\sigma_{u1}$=195
* $\sigma_{w0}$=154
* $\sigma_{w1}$=142
* $\rho_u=\rho_w$=0.6
* 37 subjects
* 15 items

Next, we generate data 100 times using the above parameter and model specification, and estimate (from lmer) the parameters each time. With the kind of sample size we have above, a maximal model does a terrible job of estimating the correlation parameters $\rho_u=\rho_w$=0.6.

However, if we generate data 100 times using 50 subjects instead of 37, and 30 items instead of 15, lmer is able to estimate the correlations reasonably well.

In both cases we fit ''maximal'' models; in the first case, it makes no sense to fit a "maximal" model because the correlation estimates tend to be over-estimated. The classical method (the generalized likelihood ratio test (the anova function in lme4) to find the ''best'' model) for determining which model is appropriate is discussed in the Pinheiro and Bates book, and would lead us to adopt a simpler model in the first case.

 Douglas Bates himself has something to say on this topic:

https://stat.ethz.ch/pipermail/r-sig-mixed-models/2014q3/022509.html

As Bates puts it:

"Estimation of variance and covariance components requires a large number of groups. It is important to realize this. It is also important to realize that in most cases you are not terribly interested in precise estimates of variance components. Sometimes you are but a substantial portion of the time you are using random effects to model subject-to-subject variability, etc. and if the data don't provide sufficient subject-to-subject variability to support the model then drop down to a simpler model. "

Here is the code I used:


Tuesday, December 17, 2013

lmer vs Stan for a somewhat involved dataset.

Here is a comparison of lmer vs Stan output on a mildly complicated dataset from a psychology expt. (Kliegl et al 2011). The data are here: https://www.dropbox.com/s/pwuz1g7rtwy17p1/KWDYZ_test.rda.

The data and paper available from: http://openscience.uni-leipzig.de/index.php/mr2

I should say that datasets from psychology and psycholinguistic can be much more complicated than this. So this was only a modest test of Stan.

The basic result is that I was able to recover in Stan the parameter estimates (fixed effects) that were primarily of interest, compared to the lmer output. The sds of the variance components all come out pretty much the same in Stan vs lmer. The correlations estimated in Stan are much smaller than lmer, but this is normal: the bayesian models seem to be more conservative when it comes to estimating correlations between random effects.

Traceplots are here: https://www.dropbox.com/s/91xhk7ywpvh9q24/traceplotkliegl2011.pdf

They look generally fine to me.

One very important fact about lmer vs Stan is that lmer took 23 seconds to return an answer, but Stan took 18,814 seconds (about 5 hours), running 500 iterations and 2 chains.

One caveat is that I do have to try to figure out how to speed up Stan so that we get the best performance out of it that is possible.



Monday, December 16, 2013

The most common linear mixed models in psycholinguistics, using JAGS and Stan

As part of my course in bayesian data analysis, I have put up some common linear mixed models that we fit in psycholinguistics. These are written in JAGS and Stan. Comments and suggestions for improvement are most welcome.

Code: http://www.ling.uni-potsdam.de/~vasishth/lmmexamplecode.txt
Data: http://www.ling.uni-potsdam.de/~vasishth/data/gibsonwu2012data.txt

Tuesday, October 08, 2013

New course on bayesian data analysis for psycholinguistics

I decided to teach a basic course on bayesian data analysis with a focus on psycholinguistics. Here is the course website (below). How could this possibly be a bad idea!

http://www.ling.uni-potsdam.de/~vasishth/advanceddataanalysis.html

Friday, March 15, 2013

How are the random effects (BLUPs) `predicted' in linear mixed models?




In linear mixed models, we fit models like these (the Ware-Laird formulation--see Pinheiro and Bates 2000, for example):

\begin{equation}
Y = X\beta + Zu + \epsilon
\end{equation}

Let $u\sim N(0,\sigma_u^2)$, and this is independent from $\epsilon\sim N(0,\sigma^2)$.

Given $Y$, the ``minimum mean square error predictor'' of $u$ is the conditional expectation:

\begin{equation}
\hat{u} = E(u\mid Y)
\end{equation}

We can find $E(u\mid Y)$ as follows. We write the joint distribution of $Y$ and $u$ as:

\begin{equation}
\begin{pmatrix}
Y \\
u
\end{pmatrix}
=
N\left(
\begin{pmatrix}
X\beta\\
0
\end{pmatrix},
\begin{pmatrix}
V_Y & C_{Y,u}\\
C_{u,Y} & V_u \\
\end{pmatrix}
\right)
\end{equation}

$V_Y, C_{Y,u}, C_{u,Y}, V_u$ are the various variance-covariance matrices.
It is a fact (need to track this down) that

\begin{equation}
u\mid Y \sim N(C_{u,Y}V_Y^{-1}(Y-X\beta)),
Y_u - C_{u,Y} V_Y^{-1} C_{Y,u})
\end{equation}

This apparently allows you to derive the BLUPs:

\begin{equation}
\hat{u}= C_{u,Y}V_Y^{-1}(Y-X\beta))
\end{equation}

Substituting $\hat{\beta}$ for $\beta$, we get:

\begin{equation}
BLUP(u)= \hat{u}(\hat{\beta})C_{u,Y}V_Y^{-1}(Y-X\hat{\beta}))
\end{equation}

Here is a working example:







Correlations of fixed effects in linear mixed models

Ever wondered what those correlations are in a linear mixed model? For example:


The estimated correlation between $\hat{\beta}_1$ and $\hat{\beta}_2$ is $0.988$.  Note that

$\hat{\beta}_1 = (Y_{1,1} + Y_{2,1} + \dots + Y_{10,1})/10=10.360$

and 

$\hat{\beta}_2 = (Y_{1,2} + Y_{2,2} + \dots + Y_{10,2})/10 = 11.040$

From this we can recover the correlation $0.988$ as follows:


By comparison, in the linear model version of the above:


because $Var(\hat{\beta}) = \hat{\sigma}^2 (X^T X)^{-1}$.