Shravan Vasishth's Slog (Statistics blog)

Saturday, August 23, 2014

An adverse consequence of fitting "maximal" linear mixed models

Distribution of intercept-slope correlation estimates with 37 subjects, 15 items

Distribution of intercept-slope correlation estimates with 50 subjects, 30 items

Should one always fit a full variance covariance matrix (a "maximal" model) when one analyzes repeated measures data-sets using linear mixed models? Here, I present one reason why blindly fitting ''maximal'' models does not make much sense.

Let's create a repeated measures data-set that has two conditions (we want to keep this example simple), and the following underlying generative distribution, which is estimated from the Gibson and Wu 2012 (Language and Cognitive Processes) data-set. The dependent variable is reading time (rt).

\begin{equation}\label{eq:ranslp2}
rt_{i} = \beta_0 + u_{0j} + w_{0k} + (\beta_1 + u_{1j} + w_{1k}) \hbox{x}_i + \epsilon_i
\end{equation}

\begin{equation}
\begin{pmatrix}
u_{0j} \\
u_{1j}
\end{pmatrix}
\sim
N\left(
\begin{pmatrix}
0 \\
0
\end{pmatrix},
\Sigma_{u}
\right)
\quad
\begin{pmatrix}
w_{0k} \\
w_{1k} \\
\end{pmatrix}
\sim
N \left(
\begin{pmatrix}
0 \\
0
\end{pmatrix},
\Sigma_{w}
\right)
\end{equation}

\begin{equation}\label{eq:sigmau}
\Sigma_u =
\left[ \begin{array}{cc}
\sigma_{\mathrm{u0}}^2 & \rho_u \, \sigma_{u0} \sigma_{u1} \\
\rho_u \, \sigma_{u0} \sigma_{u1} & \sigma_{u1}^2\end{array} \right]
\end{equation}

\begin{equation}\label{eq:sigmaw}
\Sigma_w =
\left[ \begin{array}{cc}
\sigma_{\mathrm{w0}}^2 & \rho_w \, \sigma_{w0} \sigma_{w1} \\
\rho_w \, \sigma_{w0} \sigma_{w1} & \sigma_{w1}^2\end{array} \right]
\end{equation}

\begin{equation}
\epsilon_i \sim N(0,\sigma^2)
\end{equation}

One difference from the Gibson and Wu data-set is that each subject is assumed to see each instance of each item (like in the old days of ERP research), but nothing hinges on this simplification; the results presented will hold regardless of whether we do a Latin square or not (I tested this).

The parameters and sample sizes are assumed to have the following values:

* $\beta_1$=487
* $\beta_2$= 61.5

* $\sigma$=544
* $\sigma_{u0}$=160
* $\sigma_{u1}$=195
* $\sigma_{w0}$=154
* $\sigma_{w1}$=142
* $\rho_u=\rho_w$=0.6
* 37 subjects
* 15 items

Next, we generate data 100 times using the above parameter and model specification, and estimate (from lmer) the parameters each time. With the kind of sample size we have above, a maximal model does a terrible job of estimating the correlation parameters $\rho_u=\rho_w$=0.6.

However, if we generate data 100 times using 50 subjects instead of 37, and 30 items instead of 15, lmer is able to estimate the correlations reasonably well.

In both cases we fit ''maximal'' models; in the first case, it makes no sense to fit a "maximal" model because the correlation estimates tend to be over-estimated. The classical method (the generalized likelihood ratio test (the anova function in lme4) to find the ''best'' model) for determining which model is appropriate is discussed in the Pinheiro and Bates book, and would lead us to adopt a simpler model in the first case.

Douglas Bates himself has something to say on this topic:

https://stat.ethz.ch/pipermail/r-sig-mixed-models/2014q3/022509.html

As Bates puts it:

"Estimation of variance and covariance components requires a large number of groups. It is important to realize this. It is also important to realize that in most cases you are not terribly interested in precise estimates of variance components. Sometimes you are but a substantial portion of the time you are using random effects to model subject-to-subject variability, etc. and if the data don't provide sufficient subject-to-subject variability to support the model then drop down to a simpler model. "

Here is the code I used:

Tuesday, December 17, 2013

lmer vs Stan for a somewhat involved dataset.

Here is a comparison of lmer vs Stan output on a mildly complicated dataset from a psychology expt. (Kliegl et al 2011). The data are here: https://www.dropbox.com/s/pwuz1g7rtwy17p1/KWDYZ_test.rda.

The data and paper available from: http://openscience.uni-leipzig.de/index.php/mr2

I should say that datasets from psychology and psycholinguistic can be much more complicated than this. So this was only a modest test of Stan.

The basic result is that I was able to recover in Stan the parameter estimates (fixed effects) that were primarily of interest, compared to the lmer output. The sds of the variance components all come out pretty much the same in Stan vs lmer. The correlations estimated in Stan are much smaller than lmer, but this is normal: the bayesian models seem to be more conservative when it comes to estimating correlations between random effects.

Traceplots are here: https://www.dropbox.com/s/91xhk7ywpvh9q24/traceplotkliegl2011.pdf

They look generally fine to me.

One very important fact about lmer vs Stan is that lmer took 23 seconds to return an answer, but Stan took 18,814 seconds (about 5 hours), running 500 iterations and 2 chains.

One caveat is that I do have to try to figure out how to speed up Stan so that we get the best performance out of it that is possible.

Monday, December 16, 2013

The most common linear mixed models in psycholinguistics, using JAGS and Stan

As part of my course in bayesian data analysis, I have put up some common linear mixed models that we fit in psycholinguistics. These are written in JAGS and Stan. Comments and suggestions for improvement are most welcome.

Code: http://www.ling.uni-potsdam.de/~vasishth/lmmexamplecode.txt
Data: http://www.ling.uni-potsdam.de/~vasishth/data/gibsonwu2012data.txt

Tuesday, October 08, 2013

New course on bayesian data analysis for psycholinguistics

I decided to teach a basic course on bayesian data analysis with a focus on psycholinguistics. Here is the course website (below). How could this possibly be a bad idea!

http://www.ling.uni-potsdam.de/~vasishth/advanceddataanalysis.html

Friday, March 15, 2013

How are the random effects (BLUPs) `predicted' in linear mixed models?

In linear mixed models, we fit models like these (the Ware-Laird formulation--see Pinheiro and Bates 2000, for example):

\begin{equation}
Y = X\beta + Zu + \epsilon
\end{equation}

Let $u\sim N(0,\sigma_u^2)$, and this is independent from $\epsilon\sim N(0,\sigma^2)$.

Given $Y$, the ``minimum mean square error predictor'' of $u$ is the conditional expectation:

\begin{equation}
\hat{u} = E(u\mid Y)
\end{equation}

We can find $E(u\mid Y)$ as follows. We write the joint distribution of $Y$ and $u$ as:

\begin{equation}
\begin{pmatrix}
Y \\
u
\end{pmatrix}
=
N\left(
\begin{pmatrix}
X\beta\\
0
\end{pmatrix},
\begin{pmatrix}
V_Y & C_{Y,u}\\
C_{u,Y} & V_u \\
\end{pmatrix}
\right)
\end{equation}

$V_Y, C_{Y,u}, C_{u,Y}, V_u$ are the various variance-covariance matrices.
It is a fact (need to track this down) that

\begin{equation}
u\mid Y \sim N(C_{u,Y}V_Y^{-1}(Y-X\beta)),
Y_u - C_{u,Y} V_Y^{-1} C_{Y,u})
\end{equation}

This apparently allows you to derive the BLUPs:

\begin{equation}
\hat{u}= C_{u,Y}V_Y^{-1}(Y-X\beta))
\end{equation}

Substituting $\hat{\beta}$ for $\beta$, we get:

\begin{equation}
BLUP(u)= \hat{u}(\hat{\beta})C_{u,Y}V_Y^{-1}(Y-X\hat{\beta}))
\end{equation}

Here is a working example: