Saturday, August 23, 2014

An adverse consequence of fitting "maximal" linear mixed models

Distribution of intercept-slope correlation estimates with 37 subjects, 15 items

Distribution of intercept-slope correlation estimates with 50 subjects, 30 items
Should one always fit a full variance covariance matrix (a "maximal" model) when one analyzes repeated measures data-sets using linear mixed models? Here, I present one reason why blindly fitting ''maximal'' models does not make much sense.

Let's create a repeated measures data-set that has two conditions (we want to keep this example simple), and the following underlying generative distribution, which is estimated from the Gibson and Wu 2012 (Language and Cognitive Processes) data-set. The dependent variable is reading time (rt).

rt_{i} = \beta_0 + u_{0j} + w_{0k} + (\beta_1 + u_{1j} + w_{1k}) \hbox{x}_i + \epsilon_i

  u_{0j} \\
  0 \\
  w_{0k} \\
  w_{1k} \\
N \left(
  0 \\

\Sigma_u =
\left[ \begin{array}{cc}
\sigma_{\mathrm{u0}}^2 & \rho_u \, \sigma_{u0} \sigma_{u1}  \\
\rho_u \, \sigma_{u0} \sigma_{u1} & \sigma_{u1}^2\end{array} \right]

\Sigma_w =
\left[ \begin{array}{cc}
\sigma_{\mathrm{w0}}^2 & \rho_w \, \sigma_{w0} \sigma_{w1}  \\
\rho_w \, \sigma_{w0} \sigma_{w1} & \sigma_{w1}^2\end{array} \right]

\epsilon_i \sim N(0,\sigma^2)

One difference from the Gibson and Wu data-set is that each subject is assumed to see each instance of each item (like in the old days of ERP research), but nothing hinges on this simplification; the results presented will hold regardless of whether we do a Latin square or not (I tested this).

The  parameters and sample sizes are assumed to have the following values:

* $\beta_1$=487
* $\beta_2$= 61.5

* $\sigma$=544
* $\sigma_{u0}$=160
* $\sigma_{u1}$=195
* $\sigma_{w0}$=154
* $\sigma_{w1}$=142
* $\rho_u=\rho_w$=0.6
* 37 subjects
* 15 items

Next, we generate data 100 times using the above parameter and model specification, and estimate (from lmer) the parameters each time. With the kind of sample size we have above, a maximal model does a terrible job of estimating the correlation parameters $\rho_u=\rho_w$=0.6.

However, if we generate data 100 times using 50 subjects instead of 37, and 30 items instead of 15, lmer is able to estimate the correlations reasonably well.

In both cases we fit ''maximal'' models; in the first case, it makes no sense to fit a "maximal" model because the correlation estimates tend to be over-estimated. The classical method (the generalized likelihood ratio test (the anova function in lme4) to find the ''best'' model) for determining which model is appropriate is discussed in the Pinheiro and Bates book, and would lead us to adopt a simpler model in the first case.

 Douglas Bates himself has something to say on this topic:

As Bates puts it:

"Estimation of variance and covariance components requires a large number of groups. It is important to realize this. It is also important to realize that in most cases you are not terribly interested in precise estimates of variance components. Sometimes you are but a substantial portion of the time you are using random effects to model subject-to-subject variability, etc. and if the data don't provide sufficient subject-to-subject variability to support the model then drop down to a simpler model. "

Here is the code I used:


Dale Barr said...

I'm missing the part about how biased/unreliable estimates of variance components is an "adverse consequence" of maximal LMEMs. As the quote from Bates suggests, getting good estimates of variance components is rarely of primary interest. These are generally considered to be nuisance parameters, included in the model to improve our estimates of the standard errors of fixed effects.
I'm not sure how using model selection to decide whether we should estimate these parameters at all avoids the bias that you document. When we decide not to estimate such parameters by whatever procedure, we effectively assume the parameter equals zero. In that sense, model selection will mostly introduce more "bias" in the estimate of any non-zero covariance parameter. Perhaps there is an some advantage for model selection when the parameter value is very close to zero. What I'd like to see most, however, is evidence of adverse consequences of max LMEMs in terms of the standard error for the corresponding fixed effect, since that is usually of primary interest. The very many simulations we have done, both in our "keep it maximal" paper as well as here failed to detect any advantage whatsoever for model selection.

Jake Westfall said...

Hi Shravan,

Interesting post. Your main point is (correct me if I'm wrong) that fitting the maximal model, which includes the random intercept-slope correlations, doesn't make sense here because we don't have enough data to get good estimates of the correlation parameters. Furthermore, likelihood ratio testing would almost certainly lead us to prefer a simpler model, presumably one that sets these correlations to 0.

I think Barr et al. (2013), the progenitors of the "maximal model" terminology, already cover this pretty well in their section on "Design-driven versus data-driven random effects specification" (pp. 262-263). They point out that such a backward-selection policy will almost certainly lead to inflated Type 1 error rates. They also point out that as long as the model converges without complaint, there really is no problem with estimating these additional parameters even if our estimates of them are rather poor.

One other response is that setting correlation parameters to 0 can have some unexpected deleterious consequences. Basically, the fixed effects part of such a model is no longer invariant to the coding scheme used for the predictors. The correlation between random intercepts and random slopes depends on the scale of the predictor. But so constraining that correlation to 0 will lead to different, non-equivalent models depending on, e.g., whether you have contrast- or dummy-coded categorical predictors. Obviously this is an undesirable feature of mixed models that set correlations to 0 rather than estimating them.

Another consequence of the correlation parameters depending on the scale of the predictor is that the results of likelihood ratio tests (LRTs) of the correlation parameters will also depend on the scale of the predictors! So in your example, given the way the predictors are coded, the "true" correlation is 0.6, and whether we reject the null hypothesis that the correlation is 0 is largely a question of statistical power. But note that we could just as well rescale the predictor so that the true correlation is exactly 0, in which case LRTs *derived from the same exact model* would NOT reject the null, except in the case of Type 1 errors. Or we could rescale the predictor so that the true correlation is close to +1 or -1, in which case a LRT probably *would* reject the null of 0 correlation, even with small sample sizes. In all of these cases the data and model are fundamentally the same, but our decision about whether we need the correlation parameters depends on our choice of scaling for the predictor, which is ultimately arbitrary.

Given all these complexities, I think the advice given by Barr et al. (2013) about maximal random effects specification is basically correct and useful, and that in general we should be cautious about constraining ANY estimable random effects unless their inclusion leads to convergence problems. Which is not to say that we should "never" do it... but we should be cautious.


Shravan Vasishth said...

Thanks, Dale and Jake,

I didn't see these comments until now.

I would like to quote Douglas Bates:

"Nontheless, the title does advise "keep it maximal" which I take as an
indication that you should start by fitting the most complex model you can
imagine or otherwise bad things can happen. My approach is more pragmatic,
I advise trying to determine what level of model complexity your data will

Perhaps you'd like to respond to that?

Shravan Vasishth said...

I hit enter too soon. I will post a longer response at some point and will respond to all the points both of you raise, but the central problem seems to be the reliance on p-values to make binary decisions about whether a predictor has an effect or not.