Tuesday, December 14, 2021

New paper: Syntactic and semantic interference in sentence comprehension: Support from English and German eye-tracking data

This paper is part of a larger project that has been running for 4-5 years, on the predictions of cue-based retrieval theories. This paper revisits Van Dyke 2007's design, using eye-tracking (the data are from comparable designs in English and German). The reading time patterns are consistent with syntactic interference at the moment of retrieval in both English. Semantic interference shows interesting differences between English and German---in English, semantic interference seems to happen simultaneously with syntactic interference, but in German, semantic interference is delayed (it appears in the post-critical region). The morphosyntactic properties of German could be driving the lag in semantic interference. We also discuss the data in the context of the quantitative predictions from the Lewis & Vasishth cue-based retrieval model.

One striking fact about psycholinguistics in general and interference effects in particular is that most of the data tend to come from English. Very few people work on non-English languages. I bet there are a lot of surprises in store for us once we step out of the narrow confines of English. I bet that most theories of sentence processing are overfitted to English and will not scale. And if you submit a paper to a journal using data from a non-English language, there will always be a reviewer or editor who asks you to explain why you chose language X!=English, and not English. Nobody ever questions you if you study English. A bizarre world.

Title: Syntactic and semantic interference in sentence comprehension: Support from English and German eye-tracking data

Abstract:

A long-standing debate in the sentence processing literature concerns the time course of syntactic and semantic information in online sentence comprehension. The default assumption in cue-based models of parsing is that syntactic and semantic retrieval cues simultaneously guide dependency resolution. When retrieval cues match multiple items in memory, this leads to similarity-based interference. Both semantic and syntactic interference have been shown to occur in English. However, the relative timing of syntactic vs. semantic interference remains unclear. In this first-ever cross-linguistic investigation of the time course of syntactic vs. semantic interference, the data from two eye-tracking reading experiments (English and German) suggest that the two types of interference can in principle arise simultaneously during retrieval. However, the data also indicate that semantic cues may be evaluated with a small timing lag in German compared to English. This suggests that there may be cross-linguistic variation in how syntactic and semantic cues are used to resolve linguistic dependencies in real-time.

Download pdf from here: https://psyarxiv.com/ua9yv

New paper in Computational Brain and Behavior: Sample size determination for Bayesian hierarchical models commonly used in psycholinguistics

We have just had a paper accepted in the journal Computational Brain and Behavior. This is part of a special issue that responds to the following paper on linear mixed models:
van Doorn, J., Aust, F., Haaf, J.M. et al. Bayes Factors for Mixed Models. Computational Brain and Behavior (2021). https://doi.org/10.1007/s42113-021-00113-2
There are quite a few papers in that special issue, all worth reading, but I especially liked the contribution by Singmann et al: Statistics in the Service of Science: Don't let the Tail Wag the Dog (https://psyarxiv.com/kxhfu/) They make some very good points in reaction to van Doorn et al's paper.

Our paper: Shravan Vasishth, Himanshu Yadav, Daniel J. Schad, and Bruno Nicenboim. Sample size determination for Bayesian hierarchical models commonly used in psycholinguistics. Computational Brain and Behavior, 2021.
Abstract: We discuss an important issue that is not directly related to the main theses of the van Doorn et al. (2021) paper, but which frequently comes up when using Bayesian linear mixed models: how to determine sample size in advance of running a study when planning a Bayes factor analysis. We adapt a simulation-based method proposed by Wang and Gelfand (2002) for a Bayes-factor based design analysis, and demonstrate how relatively complex hierarchical models can be used to determine approximate sample sizes for planning experiments.
Code and data: https://osf.io/hjgrm/
pdf: here

Tuesday, December 07, 2021

New paper accepted in MIT Press Journal Open Mind: Individual differences in cue weighting in sentence comprehension: An evaluation using Approximate Bayesian Computation

My PhD student Himanshu Yadav has just had an important paper on modeling individual differences provisionally accepted in the open access journal Open Mind. One reason that this paper is important is that it demonstrates why it is crucial to understand systematic individual-level behavior in the data, and what this observed data implies for computational models of sentence processing. As Blastland and Spiegelhalter put it, "The average is an abstraction. The reality is variation." Our focus should be on understanding and explaining the variation, not just average behavior. More exciting papers on this topic are coming soon from Himanshu!

The reviews from Open Mind were very high quality, certainly as high or higher quality than I have received from many top closed-access journals over the last 20 years. The journal has a top-notch editorial board, led by none other than Ted Gibson. This is our second paper in Open Mind; the first was this one. I plan to publish more of our papers in this journal (along with the other open access journal, Glossa Psycholinguistics, also led by a stellar set of editors, Fernanda Ferreira and Brian Dillon). I hope that these open access journals can become the norm for our field. I wonder what it will take for that to happen.

Himanshu Yadav, Dario Paape, Garrett Smith, Brian W. Dillon, and Shravan Vasishth. Individual differences in cue weighting in sentence comprehension: An evaluation using Approximate Bayesian Computation. Open Mind, 2021. Provisionally accepted.

The pdf is here.

Monday, December 06, 2021

New paper: Similarity-based interference in sentence comprehension in aphasia: A computational evaluation of two models of cue-based retrieval.

My PhD student Paula Lissón has just submitted this important new paper for review to a journal. This paper is important for several reasons but the most important one is that it's the first to quantitatively compare two competing computational models of retrieval in German sentence processing using data from unimpaired controls and individuals with aphasia. The work is the culmination of four years of hard work involving collecting a relatively large data-set (this amazing feat was achieved by Dorothea Pregla, and documented in a series of papers she has written, for example see this one in Brain and Language), and then developing computational models in Stan to systematically evaluate competing theoretical claims. This line of work should raise the bar in psycholinguistics when it comes to testing predictions of different theories. It is pretty common in psycholinguistics to wildly wave one's hands and say things like "sentence processing in individuals with aphasia is just noisy", and be satisfied with that statement and then publish it as a big insight into sentence processing difficulty. An important achievement of Paula's work, which builds on Bruno Nicenboim's research on Bayesian cognitive modeling, is to demonstrate how to nail down the claim and how to test it quantitatively. It seems kind of obvious that one should do that, but surprisingly, this kind of quantitative evaluation of models is still relatively rare in the field.

Title: Similarity-based interference in sentence comprehension in aphasia: A computational evaluation of two models of cue-based retrieval.

Abstract: Sentence comprehension requires the listener to link incoming words with short-term memory representations in order to build linguistic dependencies. The cue-based retrieval theory of sentence processing predicts that the retrieval of these memory representations is affected by similarity-based interference. We present the first large-scale computational evaluation of interference effects in two models of sentence processing – the activation-based model, and a modification of the direct-access model – in individuals with aphasia (IWA) and control participants in German. The parameters of the models are linked to prominent theories of processing deficits in aphasia, and the models are tested against two linguistic constructions in German: Pronoun resolution and relative clauses. The data come from a visual-world eye-tracking experiment combined with a sentence-picture matching task. The results show that both control participants and IWA are susceptible to retrieval interference, and that a combination of theoretical explanations (intermittent deficiencies, slow syntax, and resource reduction) can explain IWA’s deficits in sentence processing. Model comparisons reveal that both models have a similar predictive performance in pronoun resolution, but the activation-based model outperforms the direct-access model in relative clauses.

Download: here. Paula also has another paper modeling English data from unimpaired controls and individuals in aphasia, in Cognitive Science.

Monday, November 22, 2021

A confusing tweet on (not) transforming data keeps reappearing on the internet

I keep seeing this misleading comment on the internet over and over again:

Non-normality is relatively unimportant; at worst you just may lose a bit of power. I strongly recommend @StatModeling & Hill (2007, pp. 45-47)'s summary of key regression model assumptions. Normality of errors literally gets LOWEST priority. My experience supports this. 3/3 pic.twitter.com/R0BfQCoxdK
— Roger Levy (@roger_p_levy) December 8, 2018

Gelman is cited above, but Gelman himself has spoken out on this point and directly contradicts the above tweet: https://statmodeling.stat.columbia.edu/2019/08/21/you-should-usually-log-transform-your-positive-data/
Even the quoted part from the Gelman and Hill 2007 book is highly misleading because it is most definitely not about null hypothesis significance testing.
Non-normality is relatively unimportant in statistical data analysis the same way that a cricket ball is relatively unimportant in a cricket match. The players, the pitch, the bat, are much more important, but everyone would look pretty silly on the cricket field without that ball.
I guess if we really, really need a slogan to be able to do data analysis, it should be what one should call the MAM principle: model assumptions matter.

Friday, November 12, 2021

Book: Sentence comprehension as a cognitive process: A computational approach (Vasishth and Engelmann)

My book with Felix Engelmann has just been published. It puts together in one place 20 years of research on retrieval models, carried out by my students, colleagues, and myself.

Sunday, October 10, 2021

New paper: When nothing goes right, go left: A large-scale evaluation of bidirectional self-paced reading

Here's an interesting and important new paper led by the inimitable Dario Paape:

Title: When nothing goes right, go left: A large-scale evaluation of bidirectional self-paced reading

Download from: here.

Abstract:

In two web-based experiments, we evaluated the bidirectional self-paced reading (BSPR) paradigm recently proposed by Paape and Vasishth (2021). We used four sentence types: NP/Z garden-path sentences, RRC garden-path sentences, sentences containing inconsistent discourse continuations, and sentences containing reflexive anaphors with feature-matching but grammatically unavailable antecedents. Our results show that regressions in BSPR are associated with a decrease in positive acceptability judgments. Across all sentence types, we observed online reading patterns that are consistent with the existing eye-tracking literature. NP/Z but not RRC garden-path sentences also showed some indication of selective rereading, as predicted by the selective reanalysis hypothesis of Frazier and Rayner (1982). However, selective rereading was associated with decreased rather than increased sentence acceptability, which is not in line with the selective reanalysis hypothesis. We discuss the implications regarding the connection between selective rereading and conscious awareness, and for the use of BSPR in general.

Thursday, September 30, 2021

New paper on the reproducibility of JML articles (2019-21) after the open data policy was introduced

New paper by Anna Laurinavichyute and me:

The (ir)reproducibility of published analyses: A case study of 57 JML articles published between 2019 and 2021

Download from: https://psyarxiv.com/hf297/

Monday, September 20, 2021

Special issue in the journal Linguistics on the Replication Crisis

Friday, September 17, 2021

Applications are open: 2022 summer school on stats methods for ling and psych

Applications are now open for the sixth SMLP summer school, to be held in person (hopefully) in the Griebnitzsee campus of the University of Potsdam, Germany, 12-16 Sept 2022.

Apply here: https://vasishth.github.io/smlp2022/

Saturday, August 14, 2021

SAFAL 2: The Second South Asian Forum on the Acquisition and Processing of Language (30-31 August, 2021)

SAFAL 2: The Second South Asian Forum on the Acquisition and Processing of Language (30-31 August, 2021

Details: https://sites.google.com/view/safal2021/home

The first South Asian Forum on the Acquisition and Processing of Language (SAFAL) highlighted the need to provide a platform for showcasing and discussing acquisition and processing research in the context of South Asian languages. The second edition aims to build on this endeavour.

Following the first edition, the Second South Asian Forum on the Acquisition and Processing of Language (SAFAL) aims to provide a platform to exchange research on sentence processing, verbal semantics, computational modeling, corpus-based psycholinguistics, neurobiology of language, and child language acquisition, among others, in the context of the subcontinent's linguistic landscape.

Invited speakers:

Sakshi Bhatia is an Assistant Professor of Linguistics at the Central University of Rajasthan. Her research areas include syntax, psycholinguistics and the syntax-psycholinguistics interface

Kamal Choudhary is an Assistant Professor of Linguistics at the Indian Institute of Technology Ropar. His research areas include neurobiology of language and syntactic typology.

Friday, August 13, 2021

A common mistake in psychology and psycholinguistics: Part 2

A Common Mistake in Data Analysis (in Psychology/Linguistics): Subsetting data to carry out nested analyses (Part 2 of 2)

tl;dr

Subsetting data to carry out pairwise comparisons within one level of a 2-level or 3-level factor can lead to misleading results. This happens because by subsetting, we inadvertently drop variance components.

Data and code for both posts

The data and code for both posts can be accessed from:

https://github.com/vasishth/SubsettingProblem

Introduction

A very common mistake I see in psycholinguistics and psychology papers is subsetting the data to carry out an analysis.

In this post, I will discuss the second of two examples; the first appeared here.

It is worth repeating that in both examples, there is absolutely no implication that the authors did anything dishonest—they did their analyses in good faith. The broader problem is that in psychology and linguistics, we are rarely taught much about data analysis. We usually learn a canned cookbook style of analysis. I am pretty sure I have made this mistake in the past—I remember doing this kind of subsetting; this was before I had learnt how to do contrast coding.

I also want to point out what a big deal it is that the authors released their data. This usually happens in only about 25% of the cases, in my experience. This estimate is remarkably close to that from published estimates on the willingness of researchers to release data. So, kudos to these researchers for being among the top 25%.

So here is the second example. I originally analyzed this data to create some exercises for my students, but the theoretical questions addressed in this paper are inherently interesting for psycholinguists.

Example 2: Fedorenko et al 2006, in the Journal of Memory and Language

The paper we consider here is:

Fedorenko, E., Gibson, E., & Rohde, D. (2006). The nature of working memory capacity in sentence comprehension: Evidence against domain-specific working memory resources. Journal of memory and language, 54(4), 541-553.

This paper has been cited some 267 times at the moment of writing this post, according to google scholar. It appears in one of our top journals, Journal of Memory and Language (JML), which apparently has an impact factor of 4.014 (I like the precision to three decimal places! :) according to google. I believe this is high. By comparison, Cognition seems to be only at 3.537.

The central claim that the paper makes is summarized nicely in the abstract; I highlight the relevant part of the abstract below:

This paper reports the results of a dual-task experiment which investigates the nature of working memory resources used in sentence comprehension. Participants read sentences of varying syntactic complexity (containing subject- and object-extracted relative clauses) while remembering one or three nouns (similar to or dissimilar from the sentence-nouns). A significant on-line interaction was found between syntactic complexity and similarity between the memory-nouns and the sentence-nouns in the three memory-nouns conditions, such that the similarity between the memory-nouns and the sentence-nouns affected the more complex object-extracted relative clauses to a greater extent than the less complex subject-extracted relative clauses. These results extend Gordon, Hendrick, and Levine’s (2002) report of a trend of such an interaction. The results argue against the domain-specific view of working memory resources in sentence comprehension (Caplan & Waters, 1999).

What I will show below is that there is no evidence for the claimed interaction that is highlighted above. The only solid result in this paper is that object relatives are harder to process than subject relatives. Obviously, this is actually interesting for psycholinguists; but it is not a newsworthy result and if that had been the main claim, JML would have rejected this paper (although Cognition seems to love papers showing that object relatives are easier or harder process than subject relatives ;).

I want to stress at the outset that although I show that there is no evidence for the claimed interaction in the abstract above, this does not mean there is no interaction. There might still be. We just don’t know either way from this work.

The original paper and the main claims

The data are from this paper:

Ev was kind enough to give me the data; many thanks to her for this.

Design

The study is a self-paced reading experiment, with a 2x2x2 design, the examples are as follows:

Factor: One noun in memory set or three nouns

easy: Joel

hard: Joel-Greg-Andy

Factor: Noun type, either proper name or occupation

name: Joel-Greg-Andy

occ: poet-cartoonist-voter

Factor: Relative clause type (two versions, between subjects but within items—we ignore this detail in this blog post):

Subject-extracted, version 1: The physician | who consulted the cardiologist | checked the files | in the office.
Subject-extracted, version 2: The cardiologist | who consulted the physician | checked the files | in the office.

Object-extracted, version 1: The physician | who the cardiologist consulted | checked the files | in the office.
Object-extracted, version 2: The cardiologist | who the physician consulted | checked the files | in the office.

As hinted above, this design is not exactly 2x2x2; there is also a within items manipulation of relative clause version. That detail was ignored in the published analysis, so we ignore it here as well.

The main claims in the paper

The main claim of the paper is laid out in the abstract above, but I repeat it below:

“A significant on-line interaction was found between syntactic complexity and similarity between the memory-nouns and the sentence-nouns in the three memory-nouns conditions, such that the similarity between the memory-nouns and the sentence-nouns affected the more complex object-extracted relative clauses to a greater extent than the less complex subject-extracted relative clauses.”

The General Discussion explains this further. I highlight the important parts.

“The most interesting result presented here is an interaction between syntactic complexity and the memory- noun/sentence-noun similarity during the critical region of the linguistic materials in the hard-load (three memory-nouns) conditions: people processed object-extracted relative clauses more slowly when they had to maintain a set of nouns that were similar to the nouns in the sentence than when they had to maintain a set of nouns that were dissimilar from the nouns in the sentence; in contrast, for the less complex subject-extracted relative clauses, there was no reading time difference between the similar and dissimilar memory load conditions. In the easy-load (one memory-noun) conditions, no interaction between syntactic complexity and memory-noun/sentence-noun similarity was observed. These results provide evidence against the hypothesis whereby there is a pool of domain-specific verbal working memory resources for sentence comprehension, contra Caplan and Waters (1999). Specifically, the results reported here extend the off-line results reported by Gordon et al. (2002) who—using a similar dual-task paradigm—provided evidence of an interaction between syntactic complexity and memory-noun/sentence-noun similarity in response-accuracies to questions about the content of the sentences. Although there was also a suggestion of an on-line reading time interaction in Gordon et al.’s experiment, this effect did not reach significance. The effect of memory-noun type in subject-extracted conditions was 12.7 ms per word (50.8ms over the four-word relative clause region), and the effect of memory-noun type in object-extracted conditions was 40.1 ms (160.4 ms over the four-word relative clause region). (Thanks to Bill Levine for providing these means). In the hard-load conditions of our experiment, the effect of memory-noun type in subject-extracted conditions over the four-word relative clause region is 12 ms in raw RTs (18 ms in residual RTs), and the effect of memory- noun type in object-extracted conditions is 324 ms in raw RTs (388 ms in residual RTs). This difference in effect sizes of memory-noun type between Gordon et al.’s and our experiment is plausibly responsible for the interaction observed in the current experiment, and the lack of such an interaction in Gordon et al.’s study. The current results therefore demonstrate that Gordon et al.’s results extend to on-line processing.”

The statistical evidence in the paper for the claim

Table 3 in the paper (last line) reports the critical interaction claimed for the hard-load conditions (I added the p-values, which were not provided in the table):

“Synt x noun type [hard load]: F1(1,43)=6.58, p=0.014;F2(1,31)=6.79, p=0.014”.

There is also a minF’ provided: \(minF'(1,72)=3.34, p=0.07\). I also checked the minF’ using the standard formula from a classic paper by Herb Clark:

Clark, H. H. (1973). The language-as-fixed-effect fallacy: A critique of language statistics in psychological research. Journal of Verbal Learning and Verbal Behavior, 12(4), 335-359.

[Aside: The journal “Journal of Verbal Learning and Verbal Behavior” is now called Journal of Memory and Language.]

Here is the minF’ function:

minf <- function(f1,f2,n1,n2){
fprime <- (f1*f2)/(f1+f2)
n <- round(((f1+f2)*(f1+f2))/(((f1*f1)/n2)+((f2*f2)/n1)))
pval<-pf(fprime,1,n,lower.tail=FALSE)
return(paste("minF(1,",n,")=",round(fprime,digits=2),"p-value=",round(pval,2),"crit=",round(qf(.95,1,n))))
}

And here is the p-value for the minF’ calculation:

minf(6.58,6.79,43,31)

## [1] "minF(1, 72 )= 3.34 p-value= 0.07 crit= 4"

The first thing to notice here is that even the published claim (about an RC type-noun type interaction in the hard load condition) is not statistically significant. I have heard through the grapevine that JML allows the authors to consider an effect as statistically significant even if the MinF’ is not significant. Apparently the reason is that the MinF’ calculation is considered to be conservative. But if they are going to ignore it, why the JML wants people to report a MinF’ is a mystery to me.

Declaring a non-significant result to be significant is a very common situation in the Journal of Memory and Language, and other journals. Some other JML papers I know of that report a non-significant minF’ as their main finding in the paper, but still act as if they have significance are:

Dillon et al 2013: “For total times this analysis revealed a three-way interaction that was significant by participants and marginal by items (\(F_1(1,39)=8.0,...,p=<0.01; F_1(1,47)=4.0,...,p=<0.055; minF'(1,81)=2,66,p=0.11\))”.
Van Dyke and McElree 2006: “The interaction of LoadxInterference was significant, both by subjects and by items. … minF’(1,90)=2.35, p = 0.13.”

Incidentally, neither of the main claims in these two studies replicated (which is not the same thing as saying that their claims are false—we don’t know whether they are false or true):

Replication attempt of Dillon et al 2013: Lena A. Jäger, Daniela Mertzen, Julie A. Van Dyke, and Shravan Vasishth. Interference patterns in subject-verb agreement and reflexives revisited: A large-sample study. Journal of Memory and Language, 111, 2020.
Replication attempt of Van Dyke et al 2006: Daniela Mertzen, Anna Laurinavichyute, Brian W. Dillon, Ralf Engbert, and Shravan Vasishth. Is there cross-linguistic evidence for proactive cue-based retrieval interference in sentence comprehension? Eye-tracking data from English, German and Russian. submitted, 2021

The second paper was rejected by JML, by the way.

Data analysis of the Fedorenko et al 2006 study

## Download and install library from here:
## https://github.com/vasishth/lingpsych
library(lingpsych)
data("df_fedorenko06")
head(df_fedorenko06)

##    rctype nountype load item subj region   RT
## 2     obj     name hard   16    1      2 4454
## 6    subj     name hard    4    1      2 5718
## 10   subj     name easy   10    1      2 2137
## 14   subj     name hard   20    1      2  903
## 18   subj     name hard   12    1      2 1837
## 22   subj      occ hard   19    1      2 1128

As explained above, we have a 2x2x2 design: rctype [obj,subj] x nountype [name, occupation] x load [hard,easy]. Region 2 is the critical region, the entire relative clause, and RT is the reading time for this region. Subject and item columns are self-explanatory.

We notice immediately that something is wrong in this data frame. Compare subject 1 vs the other subjects:

## something is wrong here:
xtabs(~subj+item,df_fedorenko06)

##     item
## subj 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
##   1  2 2 2 2 2 2 2 2 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
##   2  1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   3  1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   4  1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   5  1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   6  1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   7  1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   8  1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   9  1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   11 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   12 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   13 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   14 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   15 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   16 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   17 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   18 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   19 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   20 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   21 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   22 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   23 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   24 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   25 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   26 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   28 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   29 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   30 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   31 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   32 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   33 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   34 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   35 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   36 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   37 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   38 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   39 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   40 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   41 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   42 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   43 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   44 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   46 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   47 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     item
## subj 29 30 31 32
##   1   2  2  2  2
##   2   1  1  1  1
##   3   1  1  1  1
##   4   1  1  1  1
##   5   1  1  1  1
##   6   1  1  1  1
##   7   1  1  1  1
##   8   1  1  1  1
##   9   1  1  1  1
##   11  1  1  1  1
##   12  1  1  1  1
##   13  1  1  1  1
##   14  1  1  1  1
##   15  1  1  1  1
##   16  1  1  1  1
##   17  1  1  1  1
##   18  1  1  1  1
##   19  1  1  1  1
##   20  1  1  1  1
##   21  1  1  1  1
##   22  1  1  1  1
##   23  1  1  1  1
##   24  1  1  1  1
##   25  1  1  1  1
##   26  1  1  1  1
##   28  1  1  1  1
##   29  1  1  1  1
##   30  1  1  1  1
##   31  1  1  1  1
##   32  1  1  1  1
##   33  1  1  1  1
##   34  1  1  1  1
##   35  1  1  1  1
##   36  1  1  1  1
##   37  1  1  1  1
##   38  1  1  1  1
##   39  1  1  1  1
##   40  1  1  1  1
##   41  1  1  1  1
##   42  1  1  1  1
##   43  1  1  1  1
##   44  1  1  1  1
##   46  1  1  1  1
##   47  1  1  1  1

Subject 1 saw all items twice! There was apparently a bug in the code, or maybe two different subjects were mis-labeled as subject 1. Hard to know. This is one more reason why one should not load and analyze data immediately; always check that the design was correctly implemented.

We could remove subject 1 from this data, leaving us with 43 subjects:

df_fedorenko06<-subset(df_fedorenko06,subj!=1)
sort(unique(df_fedorenko06$subj))
length(unique(df_fedorenko06$subj))

But we won’t do this yet.

A full 2x2x2 ANOVA analysis:

First, set up an ANOVA style coding for the data, in lmer:

library(lme4)

## Loading required package: Matrix

df_fedorenko06$noun<-ifelse(df_fedorenko06$nountype=="name",1/2,-1/2)
df_fedorenko06$ld<-ifelse(df_fedorenko06$load=="hard",1/2,-1/2)
df_fedorenko06$rc<-ifelse(df_fedorenko06$rctype=="obj",1/2,-1/2)
df_fedorenko06$nounxld<-df_fedorenko06$noun*df_fedorenko06$ld*2
df_fedorenko06$nounxrc<-df_fedorenko06$noun*df_fedorenko06$rc*2
df_fedorenko06$ldxrc<-df_fedorenko06$ld*df_fedorenko06$rc*2
df_fedorenko06$nounxldxrc<-df_fedorenko06$noun*df_fedorenko06$ld*df_fedorenko06$rc*4

I am using effect coding here (\(\pm 1/2\)), so that every estimate reflects the estimated difference in means for the particular comparison being made. When multiplying to get an interaction, I have to multiply with a factor of 2 or 4 to get the effect coding back; if I don’t do that I will get \(\pm 1/4\) or \(\pm 1/8\).

I’m using effect coding because I was interested in getting an estimate of the relative clause effect for our Bayesian textbook’s chapter 6, on prior self-elicitation.

The data frame now has new columns representing the contrast coding:

head(df_fedorenko06)

##    rctype nountype load item subj region   RT noun   ld   rc nounxld nounxrc
## 2     obj     name hard   16    1      2 4454  0.5  0.5  0.5     0.5     0.5
## 6    subj     name hard    4    1      2 5718  0.5  0.5 -0.5     0.5    -0.5
## 10   subj     name easy   10    1      2 2137  0.5 -0.5 -0.5    -0.5    -0.5
## 14   subj     name hard   20    1      2  903  0.5  0.5 -0.5     0.5    -0.5
## 18   subj     name hard   12    1      2 1837  0.5  0.5 -0.5     0.5    -0.5
## 22   subj      occ hard   19    1      2 1128 -0.5  0.5 -0.5    -0.5     0.5
##    ldxrc nounxldxrc
## 2    0.5        0.5
## 6   -0.5       -0.5
## 10   0.5        0.5
## 14  -0.5       -0.5
## 18  -0.5       -0.5
## 22  -0.5        0.5

Display the estimated means in each condition, just to get some idea of what’s coming.

round(with(df_fedorenko06,tapply(RT,nountype,mean)))

## name  occ 
## 1636 1754

round(with(df_fedorenko06,tapply(RT,load,mean)))

## easy hard 
## 1608 1782

round(with(df_fedorenko06,tapply(RT,rctype,mean)))

##  obj subj 
## 1909 1482

The authors report raw and residual RTs, but we analyze raw RTs, and then analyze log RTs. There’s actually no point in computing residual RTs (I will discuss this at some point in a different blog post), so we ignore the residuals analysis, although nothing much will change in my main points below even if we were to use residual RTs.

I fit the most complex model I could. If I were a Bayesian (which I am), I would have done the full Monty just because I can, but never mind.

## raw RTs
m1<-lmer(RT~noun+ld+rc+nounxld + nounxrc + ldxrc + nounxldxrc+(1+ld+rc+ nounxrc + nounxldxrc||subj) + (1+ld+rc+ nounxrc + nounxldxrc||item),df_fedorenko06)
#summary(m1)

## log RTs
m2<-lmer(log(RT)~noun+ld+rc+nounxld + nounxrc + ldxrc + nounxldxrc+(1+ld+ nounxrc||subj) + (1+ld+rc||item),df_fedorenko06)
#summary(m2)

Here is the summary output from the two models:

## raw RTs:
summary(m1)$coefficients

##               Estimate Std. Error    t value
## (Intercept) 1675.79757  125.88880 13.3117286
## noun        -116.07053   58.60273 -1.9806337
## ld           155.14812  112.39971  1.3803249
## rc           422.73102   77.84215  5.4306182
## nounxld     -103.39623   58.60364 -1.7643312
## nounxrc     -110.87780   61.73703 -1.7959692
## ldxrc        -23.47607   58.62151 -0.4004685
## nounxldxrc   -95.76489   66.95281 -1.4303340

## log RTs:
summary(m2)$coefficients

##                Estimate Std. Error     t value
## (Intercept)  7.20041271 0.06585075 109.3444271
## noun        -0.03127831 0.02457298  -1.2728741
## ld           0.01532113 0.04530999   0.3381403
## rc           0.21828914 0.02812620   7.7610591
## nounxld     -0.04645901 0.02457339  -1.8906222
## nounxrc     -0.02132791 0.02564043  -0.8318078
## ldxrc       -0.03704690 0.02457244  -1.5076607
## nounxldxrc  -0.01944255 0.02462971  -0.7893945

Both the models only show that object relatives are harder to process than subject relatives. No other main effect or interactions pan out. This (non-)result is actually consistent with what’s reported in the paper.

For the story about the hard load conditions (discussed below) to hold up, a prerequisite is that the nounxldxrc interaction be significant, but it is not.

Subset analyses using repeated measures ANOVA/paired t-test

The Syn x noun interaction in the hard load conditions reported in the paper as the main result was computed by first subsetting the data such that we have only the hard load conditions’ data. This leads to a 2x2 design. Let’s see what happens when we do this analysis by subsetting the data. We have dropped half the data:

dim(df_fedorenko06)

## [1] 1440   14

fed06hard<-subset(df_fedorenko06,load=="hard")
dim(fed06hard)

## [1] 720  14

(1440-720)/1440

## [1] 0.5

Let’s first do a repeated measures ANOVA (=paired t-tests) on raw RTs by subject and by item, because that is what Fedorenko et al did (well, they used residual RTs, but nothing will change if we use that dependent measure).

## aggregate by subject:
fedsubjRC<-aggregate(RT~subj+rctype,mean,data=fed06hard)
head(fedsubjRC)

##   subj rctype       RT
## 1    1    obj 3387.500
## 2    2    obj  892.125
## 3    3    obj 1373.000
## 4    4    obj 1738.625
## 5    5    obj 7074.000
## 6    6    obj 3119.625

t.test(RT~rctype,paired=TRUE,fedsubjRC)

## 
##  Paired t-test
## 
## data:  RT by rctype
## t = 3.5883, df = 43, p-value = 0.0008469
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  176.7331 630.3151
## sample estimates:
## mean of the differences 
##                403.5241

fedsubjN<-aggregate(RT~subj+nountype,mean,data=fed06hard)
head(fedsubjN)

##   subj nountype       RT
## 1    1     name 2971.500
## 2    2     name  928.875
## 3    3     name 1161.500
## 4    4     name 1414.125
## 5    5     name 4472.250
## 6    6     name 1831.875

t.test(RT~nountype,paired=TRUE,fedsubjN)

## 
##  Paired t-test
## 
## data:  RT by nountype
## t = -2.6432, df = 43, p-value = 0.01141
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -383.91794  -51.61331
## sample estimates:
## mean of the differences 
##               -217.7656

## interaction: diff between obj and subj in name vs in occ conditions:
RCocc<-aggregate(RT~subj+rctype,mean,data=subset(fed06hard,nountype=="occ"))
diff_occ<-subset(RCocc,rctype=="obj")$RT-subset(RCocc,rctype=="subj")$RT

RCname<-aggregate(RT~subj+rctype,mean,data=subset(fed06hard,nountype=="name"))
diff_name<-subset(RCname,rctype=="obj")$RT-subset(RCname,rctype=="subj")$RT

## by-subject interaction:
t.test(diff_occ,diff_name,paired=TRUE)

## 
##  Paired t-test
## 
## data:  diff_occ and diff_name
## t = 2.131, df = 43, p-value = 0.03885
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   21.71398 787.90534
## sample estimates:
## mean of the differences 
##                404.8097

One can of course do this with an ANOVA function:

fedsubjANOVA<-aggregate(RT~subj+rctype+nountype,mean,data=fed06hard)
head(fedsubjANOVA)

##   subj rctype nountype      RT
## 1    1    obj     name 3030.25
## 2    2    obj     name  962.50
## 3    3    obj     name 1087.50
## 4    4    obj     name 1687.00
## 5    5    obj     name 5006.75
## 6    6    obj     name 1064.50

library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

subj_anova_hard<-anova_test(data = fedsubjANOVA, 
           dv = RT, 
           wid = subj,
           within = c(rctype,nountype)
  )
get_anova_table(subj_anova_hard)

## ANOVA Table (type III tests)
## 
##            Effect DFn DFd      F        p p<.05   ges
## 1          rctype   1  43 12.876 0.000847     * 0.031
## 2        nountype   1  43  6.986 0.011000     * 0.009
## 3 rctype:nountype   1  43  4.541 0.039000     * 0.008

We see a significant interaction by subjects, as reported in the paper.

By items:

## aggregate by item:
feditemRC<-aggregate(RT~item+rctype,mean,data=fed06hard)
head(feditemRC)

##   item rctype       RT
## 1    1    obj 1978.545
## 2    2    obj 2076.667
## 3    3    obj 3483.273
## 4    4    obj 2489.000
## 5    5    obj 1322.091
## 6    6    obj 1268.800

t.test(RT~rctype,paired=TRUE,feditemRC)

## 
##  Paired t-test
## 
## data:  RT by rctype
## t = 3.9827, df = 31, p-value = 0.0003833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  194.5260 602.8663
## sample estimates:
## mean of the differences 
##                398.6962

feditemN<-aggregate(RT~item+nountype,mean,data=fed06hard)
head(feditemN)

##   item nountype       RT
## 1    1     name 1573.727
## 2    2     name 1566.455
## 3    3     name 1580.364
## 4    4     name 2808.667
## 5    5     name 1251.000
## 6    6     name 1261.818

t.test(RT~nountype,paired=TRUE,feditemN)

## 
##  Paired t-test
## 
## data:  RT by nountype
## t = -1.6978, df = 31, p-value = 0.09957
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -477.39928   43.65591
## sample estimates:
## mean of the differences 
##               -216.8717

## interaction: diff between obj and item in name vs in occ conditions:
RCocc<-aggregate(RT~item+rctype,mean,data=subset(fed06hard,nountype=="occ"))
diff_occ<-subset(RCocc,rctype=="obj")$RT-subset(RCocc,rctype=="subj")$RT

RCname<-aggregate(RT~item+rctype,mean,data=subset(fed06hard,nountype=="name"))
diff_name<-subset(RCname,rctype=="obj")$RT-subset(RCname,rctype=="subj")$RT

## by-item interaction:
t.test(diff_occ,diff_name,paired=TRUE)

## 
##  Paired t-test
## 
## data:  diff_occ and diff_name
## t = 2.0274, df = 31, p-value = 0.05129
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   -2.45258 823.20437
## sample estimates:
## mean of the differences 
##                410.3759

Using ANOVA:

feditemANOVA<-aggregate(RT~item+rctype+nountype,mean,data=fed06hard)
head(feditemANOVA)

##   item rctype nountype       RT
## 1    1    obj     name 2206.000
## 2    2    obj     name 1946.667
## 3    3    obj     name 1966.167
## 4    4    obj     name 2910.600
## 5    5    obj     name 1011.500
## 6    6    obj     name 1496.200

item_anova_hard<-anova_test(data = feditemANOVA, 
           dv = RT, 
           wid = item,
           within = c(rctype,nountype)
  )
get_anova_table(item_anova_hard)

## ANOVA Table (type III tests)
## 
##            Effect DFn DFd      F        p p<.05   ges
## 1          rctype   1  31 14.802 0.000557     * 0.085
## 2        nountype   1  31  2.867 0.100000       0.028
## 3 rctype:nountype   1  31  4.110 0.051000       0.023

Notice that even if the by-subjects and by-items analyses are significant, the MinF’ is not. As Clark points out in his paper, MinF’ is critical to compute in order to draw inferences about the hypothesis test.

Since \(F=t^2\), we can compute the MinF’.

F1<-2.13^2
F2<-2.03^2
minf(F1,F2,43,31)

## [1] "minF(1, 71 )= 2.16 p-value= 0.15 crit= 4"

We can do this MinF’ calculation with the ANOVA output as well:

minf(4.541,4.11,43,31)

## [1] "minF(1, 71 )= 2.16 p-value= 0.15 crit= 4"

We get exactly the same MinF’ in both the t-tests and ANOVA because they are doing the same test.

The upshot of the ANOVA/t-test analyses using subsetted data

The higher-order interaction is not significant. This diverges from the main claim in the paper.

Now, let’s do the same by-subjects and items analyses using logRTs:

fed06hard$logRT<-log(fed06hard$RT)

## aggregate by subject:
fedsubjRC<-aggregate(logRT~subj+rctype,mean,data=fed06hard)
head(fedsubjRC)

##   subj rctype    logRT
## 1    1    obj 7.980754
## 2    2    obj 6.710152
## 3    3    obj 7.125770
## 4    4    obj 7.179976
## 5    5    obj 8.625057
## 6    6    obj 7.569662

t.test(logRT~rctype,paired=TRUE,fedsubjRC)

## 
##  Paired t-test
## 
## data:  logRT by rctype
## t = 4.6431, df = 43, p-value = 3.227e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1024094 0.2596785
## sample estimates:
## mean of the differences 
##               0.1810439

fedsubjN<-aggregate(logRT~subj+nountype,mean,data=fed06hard)
head(fedsubjN)

##   subj nountype    logRT
## 1    1     name 7.835673
## 2    2     name 6.740148
## 3    3     name 6.988723
## 4    4     name 7.167295
## 5    5     name 8.249557
## 6    6     name 7.142238

t.test(logRT~nountype,paired=TRUE,fedsubjN)

## 
##  Paired t-test
## 
## data:  logRT by nountype
## t = -2.2556, df = 43, p-value = 0.02924
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.151468895 -0.008471231
## sample estimates:
## mean of the differences 
##             -0.07997006

## interaction: diff between obj and subj in name vs in occ conditions:
RCocc<-aggregate(logRT~subj+rctype,mean,data=subset(fed06hard,nountype=="occ"))
diff_occ<-subset(RCocc,rctype=="obj")$logRT-subset(RCocc,rctype=="subj")$logRT

RCname<-aggregate(logRT~subj+rctype,mean,data=subset(fed06hard,nountype=="name"))
diff_name<-subset(RCname,rctype=="obj")$logRT-subset(RCname,rctype=="subj")$logRT

## by-subject interaction:
t.test(diff_occ,diff_name,paired=TRUE)

## 
##  Paired t-test
## 
## data:  diff_occ and diff_name
## t = 1.0226, df = 43, p-value = 0.3122
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.07614228  0.23279035
## sample estimates:
## mean of the differences 
##              0.07832404

Now the by-subjects t-test for the interaction is nowhere near significant (cf the raw RTs analysis above). This means that there were some extreme values in the by-subjects data driving the effect.

By items:

## aggregate by item:
feditemRC<-aggregate(logRT~item+rctype,mean,data=fed06hard)
head(feditemRC)

##   item rctype    logRT
## 1    1    obj 7.421371
## 2    2    obj 7.339790
## 3    3    obj 7.562565
## 4    4    obj 7.590270
## 5    5    obj 7.016257
## 6    6    obj 7.031505

t.test(logRT~rctype,paired=TRUE,feditemRC)

## 
##  Paired t-test
## 
## data:  logRT by rctype
## t = 3.6988, df = 31, p-value = 0.0008374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.08105886 0.28033104
## sample estimates:
## mean of the differences 
##                0.180695

feditemN<-aggregate(logRT~item+nountype,mean,data=fed06hard)
head(feditemN)

##   item nountype    logRT
## 1    1     name 7.216759
## 2    2     name 6.991090
## 3    3     name 7.192081
## 4    4     name 7.707697
## 5    5     name 7.018699
## 6    6     name 6.981104

t.test(logRT~nountype,paired=TRUE,feditemN)

## 
##  Paired t-test
## 
## data:  logRT by nountype
## t = -1.4239, df = 31, p-value = 0.1645
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.18854611  0.03351487
## sample estimates:
## mean of the differences 
##             -0.07751562

## interaction: diff between obj and item in name vs in occ conditions:
RCocc<-aggregate(logRT~item+rctype,mean,data=subset(fed06hard,nountype=="occ"))
diff_occ<-subset(RCocc,rctype=="obj")$logRT-subset(RCocc,rctype=="subj")$logRT

RCname<-aggregate(logRT~item+rctype,mean,data=subset(fed06hard,nountype=="name"))
diff_name<-subset(RCname,rctype=="obj")$logRT-subset(RCname,rctype=="subj")$logRT

## by-item interaction:
t.test(diff_occ,diff_name,paired=TRUE)

## 
##  Paired t-test
## 
## data:  diff_occ and diff_name
## t = 0.93122, df = 31, p-value = 0.3589
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.09504043  0.25475215
## sample estimates:
## mean of the differences 
##              0.07985586

Here, too, the interaction by items is not significant.

Conclusion: the MinF’ values show that there is no evidence for an interaction in the hard conditions, regardless of whether we use raw or log RTs.

But there is actually more to this story, as I show below with linear mixed models analyses.

Incorrect and correct lmer analysis

Now, we do the same analyses as above, but in lmer instead of paired t-tests, and by subsetting the data as Fedorenko did.

In order to compare this model (m4) with the correct analysis (model m6 below), we change the coding to be \(\pm 1\) instead of \(\pm 1/2\).

## 2x2 anova by doing subsetting: this is the main result in the paper
fed06hard<-subset(df_fedorenko06,load=="hard")
fed06hard$noun<-fed06hard$noun*2
fed06hard$rc<-fed06hard$rc*2
fed06hard$nounxrc<-fed06hard$nounxrc*2
fed06hard<-fed06hard[,c("subj","item","RT","noun","rc","nounxrc")]
head(fed06hard)

##    subj item   RT noun rc nounxrc
## 2     1   16 4454    1  1       1
## 6     1    4 5718    1 -1      -1
## 14    1   20  903    1 -1      -1
## 18    1   12 1837    1 -1      -1
## 22    1   19 1128   -1 -1       1
## 26    1   31 2742   -1  1      -1

Here is the analysis using subsetting (this is incorrect):

m4<-lmer(RT~noun+rc+nounxrc+(1+rc+nounxrc||subj) + (1+noun+nounxrc||item),fed06hard)
summary(m4)

## Linear mixed model fit by REML ['lmerMod']
## Formula: RT ~ noun + rc + nounxrc + ((1 | subj) + (0 + rc | subj) + (0 +  
##     nounxrc | subj)) + ((1 | item) + (0 + noun | item) + (0 +  
##     nounxrc | item))
##    Data: fed06hard
## 
## REML criterion at convergence: 12401.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0317 -0.4256 -0.1341  0.2131 11.6983 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subj     (Intercept)  915329   956.73 
##  subj.1   rc            42511   206.18 
##  subj.2   nounxrc        1570    39.62 
##  item     (Intercept)   56296   237.27 
##  item.1   noun           4123    64.21 
##  item.2   nounxrc       11841   108.82 
##  Residual             1540203  1241.05 
## Number of obs: 720, groups:  subj, 44; item, 32
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  1752.09     157.25  11.142
## noun         -110.39      47.65  -2.317
## rc            200.80      55.86   3.595
## nounxrc      -103.48      50.56  -2.047
## 
## Correlation of Fixed Effects:
##         (Intr) noun  rc   
## noun    0.000             
## rc      0.000  0.000      
## nounxrc 0.000  0.000 0.000

summary(m4)$coefficients

##              Estimate Std. Error   t value
## (Intercept) 1752.0945  157.24556 11.142410
## noun        -110.3885   47.64559 -2.316866
## rc           200.8043   55.85609  3.595031
## nounxrc     -103.4786   50.56305 -2.046527

The above analysis shows a significant interaction. But this analysis is incorrect because it’s dropping half the data, which contributes to misestimating the variance components.

Next, we will do the above analysis using all the data (no subsetting to load==hard conditions). We do this by defining nested contrasts

The design is:

## Nested analysis:
## load   e   e   e   e   h   h    h  h
## noun   o   o   n   n   o   o    n  n
## rctyp  s   o   s   o   s   o   s   o
#         a   b   c   d   e   f   g   h
#---------------------------------------          
#load     -1  -1  -1  -1  1   1    1   1          
#nn_easy  -1   -1  1   1   0   0   0   0
#rc_easy  -1   1  -1   1   0   0   0   0
#nn_hard  0    0  0    0  -1   -1  1   1 
#rc_hard  0    0  0    0  -1   1  -1   1

Define the nested coding. This time I use \(\pm 1\) coding; this is just in order to facilitate quick model specification by multiplying main effects in the lmer equation.

I could have used \(\pm 1/2\) and the only thing that would give me is that the estimated coefficients would reflect the difference in means, but I would need to define interactions as separate columsn (as done above). The statistical results will not change as a function of these two coding schemes.

df_fedorenko06$ld<-ifelse(df_fedorenko06$load=="hard",1,-1)
df_fedorenko06$nn_easy<-ifelse(df_fedorenko06$load=="easy" & df_fedorenko06$nountype=="occ",-1,
                               ifelse(df_fedorenko06$load=="easy" & df_fedorenko06$nountype=="name",1,0))
df_fedorenko06$nn_hard<-ifelse(df_fedorenko06$load=="hard" & df_fedorenko06$nountype=="occ",-1,
                               ifelse(df_fedorenko06$load=="hard" & df_fedorenko06$nountype=="name",1,0))

df_fedorenko06$rc_easy<-ifelse(df_fedorenko06$load=="easy" & df_fedorenko06$rctype=="subj",-1,
                               ifelse(df_fedorenko06$load=="easy" & df_fedorenko06$rctype=="obj",1,0))
df_fedorenko06$rc_hard<-ifelse(df_fedorenko06$load=="hard" & df_fedorenko06$rctype=="subj",-1,
                               ifelse(df_fedorenko06$load=="hard" & df_fedorenko06$rctype=="obj",1,0))

m6<-lmer(RT~ld + nn_hard*rc_hard + nn_easy*rc_easy +
           (1+ld + nn_hard:rc_hard ||subj) + 
           (1+ld + nn_hard:rc_hard + nn_easy||item),df_fedorenko06)
summary(m6)

## Linear mixed model fit by REML ['lmerMod']
## Formula: RT ~ ld + nn_hard * rc_hard + nn_easy * rc_easy + ((1 | subj) +  
##     (0 + ld | subj) + (0 + nn_hard:rc_hard | subj)) + ((1 | item) +  
##     (0 + ld | item) + (0 + nn_easy | item) + (0 + nn_hard:rc_hard |      item))
##    Data: df_fedorenko06
## 
## REML criterion at convergence: 24455.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.7050 -0.4408 -0.1223  0.2148 13.3538 
## 
## Random effects:
##  Groups   Name            Variance Std.Dev.
##  subj     (Intercept)      611484   781.97 
##  subj.1   ld                79352   281.70 
##  subj.2   nn_hard:rc_hard   20333   142.59 
##  item     (Intercept)       36175   190.20 
##  item.1   ld                16443   128.23 
##  item.2   nn_easy            1915    43.76 
##  item.3   nn_hard:rc_hard   25223   158.82 
##  Residual                 1252751  1119.26 
## Number of obs: 1440, groups:  subj, 44; item, 32
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)     1675.796    126.128  13.286
## ld                77.641     56.535   1.373
## nn_hard         -110.046     41.745  -2.636
## rc_hard          200.372     41.749   4.799
## nn_easy           -6.734     42.434  -0.159
## rc_easy          224.046     41.749   5.366
## nn_hard:rc_hard -103.208     54.818  -1.883
## nn_easy:rc_easy   -8.443     41.821  -0.202
## 
## Correlation of Fixed Effects:
##             (Intr) ld    nn_hrd rc_hrd nn_esy rc_esy nn_h:_
## ld          0.000                                          
## nn_hard     0.000  0.000                                   
## rc_hard     0.000  0.000 0.000                             
## nn_easy     0.000  0.000 0.000  0.000                      
## rc_easy     0.000  0.000 0.000  0.000  0.000               
## nn_hrd:rc_h 0.000  0.000 0.000  0.000  0.000  0.000        
## nn_sy:rc_sy 0.000  0.000 0.000  0.000  0.000  0.000  0.000

#qqPlot(residuals(m6))
#summary(m6)$coefficients

## compare with a null model,  without the 
## nn_hard:rc_hard  interaction:
m6null<-lmer(RT~ld + nn_hard+rc_hard + nn_easy*rc_easy +(1+ld + nn_hard:rc_hard ||subj) + (1+ld + nn_hard:rc_hard + nn_easy||item),df_fedorenko06)

anova(m6,m6null)

## refitting model(s) with ML (instead of REML)

## Data: df_fedorenko06
## Models:
## m6null: RT ~ ld + nn_hard + rc_hard + nn_easy * rc_easy + ((1 | subj) + (0 + ld | subj) + (0 + nn_hard:rc_hard | subj)) + ((1 | item) + (0 + ld | item) + (0 + nn_easy | item) + (0 + nn_hard:rc_hard | item))
## m6: RT ~ ld + nn_hard * rc_hard + nn_easy * rc_easy + ((1 | subj) + (0 + ld | subj) + (0 + nn_hard:rc_hard | subj)) + ((1 | item) + (0 + ld | item) + (0 + nn_easy | item) + (0 + nn_hard:rc_hard | item))
##        npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)  
## m6null   15 24566 24645 -12268    24536                       
## m6       16 24565 24649 -12266    24533 3.4482  1    0.06332 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Removing subject 1 weakens the already non-significant interaction:

m6a<-lmer(RT~ld + nn_hard*rc_hard + nn_easy*rc_easy +(1+ld + nn_hard:rc_hard ||subj) + (1+ld + nn_hard:rc_hard + nn_easy||item),subset(df_fedorenko06,subj!=1))
#summary(m6a)
#qqPlot(residuals(m6))
#summary(m6)$coefficients

## compare with a null model,  without the 
## nn_hard:rc_hard  interaction:
m6anull<-lmer(RT~ld + nn_hard+rc_hard + nn_easy*rc_easy +(1+ld + nn_hard:rc_hard ||subj) + (1+ld + nn_hard:rc_hard + nn_easy||item),subset(df_fedorenko06,subj!=1))

anova(m6a,m6anull)

## refitting model(s) with ML (instead of REML)

## Data: subset(df_fedorenko06, subj != 1)
## Models:
## m6anull: RT ~ ld + nn_hard + rc_hard + nn_easy * rc_easy + ((1 | subj) + (0 + ld | subj) + (0 + nn_hard:rc_hard | subj)) + ((1 | item) + (0 + ld | item) + (0 + nn_easy | item) + (0 + nn_hard:rc_hard | item))
## m6a: RT ~ ld + nn_hard * rc_hard + nn_easy * rc_easy + ((1 | subj) + (0 + ld | subj) + (0 + nn_hard:rc_hard | subj)) + ((1 | item) + (0 + ld | item) + (0 + nn_easy | item) + (0 + nn_hard:rc_hard | item))
##         npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)  
## m6anull   15 23385 23463 -11677    23355                       
## m6a       16 23384 23467 -11676    23352 3.0927  1    0.07864 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that the interaction (nn_hard:rc_hard) is gone once we take the full data into account.

## subsetted data:
summary(m4)$coefficients[4,]

##    Estimate  Std. Error     t value 
## -103.478646   50.563049   -2.046527

## full data:
summary(m6)$coefficients[7,]

##    Estimate  Std. Error     t value 
## -103.208049   54.817661   -1.882752

What has changed is the standard error of the interaction (in m4 it is 51 ms, in m6 it is 55 ms); the increase in the SE is due to the increased number of variance components in the full model with the nested contrast coding. The subsetted analysis (m4, and the paired t-tests) are hiding important sources of variance and giving a misleading result.

Here are the variance components in model m4 (the subsetted data):

Random effects:
 Groups   Name        Variance Std.Dev.
 subj     (Intercept)  915329   956.73 
 subj.1   rc            42511   206.18 
 subj.2   nounxrc        1570    39.62 
 item     (Intercept)   56296   237.27 
 item.1   noun           4123    64.21 
 item.2   nounxrc       11841   108.82 
 Residual             1540203  1241.05

Here are the variance components in the full model:

Random effects:
 Groups   Name            Variance Std.Dev.
 subj     (Intercept)      611472   781.97 
 subj.1   ld               317412   563.39 
 subj.2   nn_hard:rc_hard   20321   142.55 
 item     (Intercept)       36172   190.19 
 item.1   ld                65756   256.43 
 item.2   nn_easy            1912    43.73 
 item.3   nn_hard:rc_hard   25229   158.84 
 Residual                 1252759  1119.27

Notice that there are still some problems in these models:

library(car)

## Loading required package: carData

qqPlot(residuals(m6))

## 2826  546 
##  707  137

boxplot(RT~rctype,df_fedorenko06)

What if we limit reading time to 10 seconds? What percent of the data would we lose? It turns out, very little, 0.3%:

nfull<-dim(df_fedorenko06)[1]
nfull

## [1] 1440

n10000<-dim(subset(df_fedorenko06,RT<10000))[1]
100*(nfull-n10000)/nfull

## [1] 0.2777778

m7<-lmer(RT~ld + nn_hard*rc_hard + nn_easy*rc_easy +(1+ld + nn_hard:rc_hard ||subj) + (1+ld + nn_hard:rc_hard + nn_easy||item),subset(df_fedorenko06,RT<10000))

## boundary (singular) fit: see ?isSingular

summary(m7)$coefficients

##                    Estimate Std. Error    t value
## (Intercept)     1641.543453  116.01909 14.1489087
## ld                75.945797   47.41424  1.6017508
## nn_hard          -75.395584   33.93436 -2.2218064
## rc_hard          165.834591   33.92088  4.8888646
## nn_easy            6.713027   33.89550  0.1980507
## rc_easy          192.150808   33.91198  5.6661624
## nn_hard:rc_hard  -67.536307   51.79075 -1.3040226
## nn_easy:rc_easy    5.684016   33.98754  0.1672382

Notice that the interaction has shrunk from -102 ms (SE 50) in model m6 to -68 (52) in m7. This means there were a very small number of data points (0.3%) that were influential and leading to a much larger estimate.

Visualize these extreme values:

boxplot(RT~rctype+nountype,df_fedorenko06)

Plot the data without 0.3% of the extreme data:

boxplot(RT~rctype+nountype,subset(df_fedorenko06,RT<10000))

On the log ms scale, the extreme values are down-weighted:

boxplot(log(RT)~rctype+nountype,df_fedorenko06)

This means we can analyze the log ms data without deleting any extreme values:

m8<-lmer(log(RT)~ld + nn_hard*rc_hard + nn_easy*rc_easy +(1+ld + nn_hard:rc_hard ||subj) + (1+ld + nn_hard:rc_hard||item),df_fedorenko06)
summary(m8)$coefficients
qqPlot(residuals(m8))

Basically, there is no hint of an interaction between noun type and RC in the hard conditions, contrary to the claim in the paper. This was the only finding in the paper, and it is not supported by the data. The key issue was subsetting the data, which inadvertently conceals important sources of variances.

Conclusion

If you subset the data to analyze effects within one level of a two- or three-level factor, you will usually get misleading results. The reason: by subsetting data, you are artificially reducing/misestimating the different sources of variance.

The scientific consequence of this subsetting error is that we have now drawn a misleading conclusion—we think we have evidence for an effect of working memory load on sentence processing difficulty. In fact, the data could be seen as consistent with exactly the opposite claim, that by Caplan and Waters, 1999. Does this mean that the claim in the Fedorenko et al paper was wrong? No! The claim may well be true. We just don’t know from these data.

Simulation demonstrating the subsetting problem

Here, we demonstrate what happens to SE if we subset data as described above. The steps I take are:

Define code for simulating a 2x2x2 repeated measures design.
Estimate parameters from the full model based on the Fedorenko data (nested analysis).
Fit linear mixed models on simulated data: one model subsets the data, the other model uses the full data. 4. Compare the SEs of the critical interaction in the two models.

Define simulation code

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:rstatix':
## 
##     select

## assumes that no. of subjects and no. of items is divisible by 8.
gen_fake_lnorm2x2x2<-function(ncond=8,
                              nitem=NULL,
                         nsubj=NULL,
                         beta=NULL,
                         Sigma_u=NULL, # subject vcov matrix
                         Sigma_w=NULL, # item vcov matrix
                         sigma_e=NULL){
grouping<-matrix(rep(NA,ncond*ncond),ncol=ncond)
grouping[1,]<-1:8
grouping[2,]<-c(2:8,1)
grouping[3,]<-c(3:8,1:2)
grouping[4,]<-c(4:8,1:3)
grouping[5,]<-c(5:8,1:4)
grouping[6,]<-c(6:8,1:5)
grouping[7,]<-c(7:8,1:6)
grouping[8,]<-c(8,1:7)

  ## prepare data frame for 8 condition latin square:
  g1<-data.frame(item=1:nitem,
                 cond=rep(grouping[1,],nitem/ncond))
  g2<-data.frame(item=1:nitem,
                 cond=rep(grouping[2,],nitem/ncond))
  g3<-data.frame(item=1:nitem,
                 cond=rep(grouping[3,],nitem/ncond))
  g4<-data.frame(item=1:nitem,
                 cond=rep(grouping[4,],nitem/ncond))
  g5<-data.frame(item=1:nitem,
                 cond=rep(grouping[5,],nitem/ncond))
  g6<-data.frame(item=1:nitem,
                 cond=rep(grouping[6,],nitem/ncond))
  g7<-data.frame(item=1:nitem,
                 cond=rep(grouping[7,],nitem/ncond))
  g8<-data.frame(item=1:nitem,
                 cond=rep(grouping[8,],nitem/ncond))
  
  
  ## assemble data frame:
  gp1<-g1[rep(seq_len(nrow(g1)), 
              nsubj/ncond),]
  gp2<-g2[rep(seq_len(nrow(g2)), 
              nsubj/ncond),]
  gp3<-g3[rep(seq_len(nrow(g3)), 
              nsubj/ncond),]
  gp4<-g4[rep(seq_len(nrow(g4)), 
              nsubj/ncond),]
  gp5<-g5[rep(seq_len(nrow(g5)), 
              nsubj/ncond),]
  gp6<-g6[rep(seq_len(nrow(g6)), 
              nsubj/ncond),]
  gp7<-g7[rep(seq_len(nrow(g7)), 
              nsubj/ncond),]
  gp8<-g8[rep(seq_len(nrow(g8)), 
              nsubj/ncond),]
  
  fakedat<-rbind(gp1,gp2,gp3,gp4,gp5,gp6,gp7,gp8)
  
  ## add subjects:
  fakedat$subj<-rep(1:nsubj,each=nitem)

  ## name conditions:
## Nested analysis:
# Cond:   1   2   3   4   5   6    7  8  
## load   e   e   e   e   h   h    h  h
## noun   o   o   n   n   o   o    n  n
## rctyp  s   o   s   o   s   o   s   o
fakedat$load<-ifelse(fakedat$cond%in%c(1,2,3,4),"easy","hard")
fakedat$nountype<-ifelse(fakedat$cond%in%c(1,2,5,6),"occ","name")
fakedat$rctype<-ifelse(fakedat$cond%in%c(1,3,5,7),"subj","obj")

## nested contrast coding:
fakedat$ld<-ifelse(fakedat$load=="easy",1,-1)

fakedat$nn_easy<-ifelse(fakedat$load=="easy" & fakedat$nountype=="occ",-1,
                               ifelse(fakedat$load=="easy" & fakedat$nountype=="name",1,0))

fakedat$nn_hard<-ifelse(fakedat$load=="hard" & fakedat$nountype=="occ",-1,
                               ifelse(fakedat$load=="hard" & fakedat$nountype=="name",1,0))

fakedat$rc_easy<-ifelse(fakedat$load=="easy" & fakedat$rctype=="subj",-1,
                               ifelse(fakedat$load=="easy" & fakedat$rctype=="obj",1,0))
fakedat$rc_hard<-ifelse(fakedat$load=="hard" & fakedat$rctype=="subj",-1,
                               ifelse(fakedat$load=="hard" & fakedat$rctype=="obj",1,0))
    
## subject random effects:
  u<-mvrnorm(n=length(unique(fakedat$subj)),
             mu=c(rep(0,8)),Sigma=Sigma_u)
  
## item random effects
  w<-mvrnorm(n=length(unique(fakedat$item)),
             mu=c(rep(0,8)),Sigma=Sigma_w)

  ## generate data row by row:  
  N<-dim(fakedat)[1]
  rt<-rep(NA,N)
  for(i in 1:N){
    rt[i] <- rlnorm(1,beta[1] + 
                      u[fakedat[i,]$subj,1] +
                      w[fakedat[i,]$item,1] + 
                      (beta[2]+u[fakedat[i,]$subj,2]+
                         w[fakedat[i,]$item,2])*fakedat$ld[i]+
                      (beta[3]+u[fakedat[i,]$subj,3]+
                         w[fakedat[i,]$item,3])*fakedat$nn_easy[i]+
                      (beta[4]+u[fakedat[i,]$subj,4]+
                         w[fakedat[i,]$item,4])*fakedat$nn_hard[i]+
                      (beta[5]+u[fakedat[i,]$subj,5]+
                         w[fakedat[i,]$item,5])*fakedat$rc_easy[i]+
                      (beta[6]+u[fakedat[i,]$subj,6]+                       
                         w[fakedat[i,]$item,6])*fakedat$rc_hard[i]+
                      (beta[7]+u[fakedat[i,]$subj,7]+
                         w[fakedat[i,]$item,7])*(fakedat$nn_easy[i]:fakedat$rc_easy[i])+
                      (beta[8]+u[fakedat[i,]$subj,8]+
                         w[fakedat[i,]$item,8])*(fakedat$nn_hard[i]:fakedat$rc_hard[i]),
                    
                   sigma_e) 
  }   
  fakedat$rt<-rt
  fakedat$subj<-factor(fakedat$subj)
  fakedat$item<-factor(fakedat$item)
  fakedat
  }

Fit model to full data

In order to estimate all parameters, we will have to fit a Bayesian model.

library(brms)
priors <- c(
  prior(normal(0,10), class = Intercept),
  prior(normal(0,1), class = b),
  prior(normal(0, 1), class = sd),
  prior(normal(0, 1), class = sigma)
)

m_brms<-brm(RT~ 1+ld + nn_easy + rc_easy + nn_hard + rc_hard + 
              nn_easy:rc_easy + nn_hard:rc_hard+
              (1+ld + nn_easy + rc_easy + nn_hard + rc_hard + 
              nn_easy:rc_easy + nn_hard:rc_hard|subj)+
              (1+ld + nn_easy + rc_easy + nn_hard + rc_hard + 
              nn_easy:rc_easy + nn_hard:rc_hard|item),
            family=lognormal(),
        prior=priors,
        chains=4,
        iter = 2000,
        warmup =1000,
        control = list(adapt_delta = 0.99,
                               max_treedepth=15),
        df_fedorenko06)
## save model to avoid recomputation:
#save(m_brms,file="m_brms.rda")

Extract parameter estimate means from model

Next, we extract the means of each of the parameters: 8 fixed effects, one standard deviation, 16 random intercepts and slopes by subject and item, and then 56 (gulp) correlations between subject intercepts and slopes and item intercepts and slopes.

load("m_brms.rda")
post<-brms::posterior_summary(m_brms)[1:81,1]
beta_names<-names(beta)
itemsd<-post[9:16]
itemcor<-post[25:52]
subjsd<-post[17:24]
subjcor<-post[53:80]
sigma_e<-post[81]

Next, assemble the correlation matrices. This would normally be very tedious because the off-diagonals have to be filled with the correlations. Luckily, R has a function for this that makes it really easy.

diagvals<-rep(1,8)
subjcorm <- matrix(NA, ncol = length(diagvals), 
                   nrow = length(diagvals))
subjcorm[upper.tri(subjcorm)] <- subjcor

## fill up lower triangular matrix:
t_subjcorm<-t(subjcorm)
t_subjcorm[upper.tri(t_subjcorm)] <- subjcor
## symmetric matrix:
subjcorm<-t_subjcorm
diag(subjcorm)<-diagvals
## symmetric:
subjcorm==t(subjcorm)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [7,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [8,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Do the same for the items correlation matrix:

itemcorm <- matrix(NA, ncol = length(diagvals), 
                   nrow = length(diagvals))
itemcorm[upper.tri(itemcorm)] <- itemcor

## fill up lower triangular matrix:
t_itemcorm<-t(itemcorm)
t_itemcorm[upper.tri(t_itemcorm)] <- itemcor
## symmetric matrix:
itemcorm<-t_itemcorm
diag(itemcorm)<-diagvals
## symmetric:
itemcorm==t(itemcorm)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [7,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [8,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Next, assemble the variance-covariance matrices for subjects and items’ random effects:

Sigma_u<-SIN::sdcor2cov(stddev=subjsd,
                          corr=subjcorm)
Sigma_w<-SIN::sdcor2cov(stddev=itemsd,
                        corr=itemcorm)

Simulate data

In this simulation, we will fit two models, one with the full data, and one with the subsetted data. Then, we will check how often the SE associated with the critical interaction is larger in the full vs the subsetted model.

I computed some other summary statistics too (like power for the interaction), but looking at SEs summarizes my main point.

nsim<-100
betaintsefull<-betaintsesubset<-sigmafull<-sigmasubset<-tvalsfull<-tvalssubset<-rep(NA,nsim)
for(i in 1:nsim){
  print(paste("iteration",i,sep=" "))
fakedat<-gen_fake_lnorm2x2x2(ncond=8,
                    nitem=32,
                    nsubj=48,
                         beta=beta,
                         Sigma_u=Sigma_u, # subject vcov matrix
                         Sigma_w=Sigma_w, # item vcov matrix
                         sigma_e=sigma_e)

mfull<-lmer(log(rt)~ld + nn_easy+rc_easy + nn_hard+rc_hard + nn_easy:rc_easy + nn_hard:rc_hard +
           (1+ld + nn_easy+rc_easy + nn_hard+rc_hard + nn_easy:rc_easy + nn_hard:rc_hard ||subj) + 
           (1+ld + nn_easy+rc_easy + nn_hard+rc_hard + nn_easy:rc_easy + nn_hard:rc_hard||item),fakedat,
           control=lmerControl(optimizer="nloptwrap", calc.derivs = FALSE))
sigmafull[i]<-sigma(mfull)
tvalsfull[i]<-summary(mfull)$coefficients[8,3]
betaintsefull[i]<-summary(mfull)$coefficients[8,2]


## subset the data:
fakehard<-subset(fakedat,load=="hard")
fakehard$noun<-ifelse(fakehard$nountype=="occ",1,-1)
fakehard$rc<-ifelse(fakehard$rctype=="obj",1,-1)
msubset<-lmer(log(rt)~noun+rc+noun:rc+(1+noun+rc+noun:rc||subj) + (1+noun+rc+noun:rc||item),fakehard,
              control=lmerControl(optimizer="nloptwrap", calc.derivs = FALSE))
sigmasubset[i]<-sigma(msubset)
tvalssubset[i]<-summary(msubset)$coefficients[4,3]
betaintsesubset[i]<-summary(msubset)$coefficients[4,2]
}

save(betaintsefull,file="betaintsefull.rda")
save(betaintsesubset,file="betaintsesubset.rda")

The proportion of times that the SE of the interaction in the full model is larger than in the subsetted model:

load("betaintsefull.rda")
load("betaintsesubset.rda")

mean(betaintsefull>betaintsesubset)

## [1] 0.44

This means that the interaction has an overly enthusiastic SE for the critical interaction, when we subset the data. That’s not a good thing.

PS If there are any mistakes in the code above, please let me know and I will correct them.

Search