Search

Thursday, April 05, 2018

Type M error and likelihood ratio tests

Type M error also affects likelihood ratio tests

The background for this post is the following paper entitled The statistical significance filter leads to overoptimistic expectations of replicability (Vasishth, Mertzen, J"ager, Gelman, 2017, under review): https://psyarxiv.com/hbqcw

Abstract:

Treating a result as newsworthy, i.e., publishable, because the p-value is less than 0.05 leads to overoptimistic expectations of replicability. The underlying cause of these overoptimistic expectations is Type M(agnitude) error (Gelman & Carlin, 2014): when underpowered studies yield significant results, the effect size estimates are invariably exaggerated and noisy. These effects get published, leading to an illusion that the reported findings are robust and replicable. For the first time in psycholinguistics, we demonstrate the adverse consequences of this statistical significance filter. We do this by carrying out direct replication attempts of published results from a recent paper. Six experiments (self-paced reading and eyetracking, 168 participants in total) show that the published (statistically significant) claims are so noisy that even non-significant results are fully compatible with them. We also demonstrate the stark contrast between these small-sample studies and a larger-sample study (100 participants); the latter yields much less noisy estimates but also a much smaller magnitude of the effect of interest. The small magnitude looks less compelling but is more realistic. We suggest that psycholinguistics (i) move its focus away from statistical significance, (ii) attend instead to the precision of their estimates, and (iii) carry out direct replications in order to demonstrate the existence of an effect.

Someone suggested to me that the likelihood ratio test takes the alternative hypothesis into account, so it’s not just giving evidence against the null hypothesis, like the t-test is. I show below why Type M error coupled with the statistical significance filter ensures that it doesn’t matter whether we use t-tests or likelihood ratio tests. We are always comparing the null against an alternative hypothesis that can have a highly biased mean.

Note that in this whole discussion, I am only interested in low power situations.

Consider what you mean when you state a null hypothesis. When you write:

\(H_0: \mu=0\)

you are not making a statement only about a point value. You are making a distributional statement. You are saying that the true distribution of the sample mean is \(Normal(0,\sigma)\), where \(\sigma\) is the standard error estimated from the data. So why don’t we write

\(H_0: X \sim Normal(0,\sigma)\)?

where \(X\) is the random variable generating the data? I don’t know; someone more knowledgable than me can hopefully comment on this.

The alternative hypothesis then becomes:

\(H_a: X \sim Normal(\mu,\sigma), \mu\neq 0\)?

Note that at this stage, before collecting any data, we have no specific \(\mu\) in mind here, it’s any number that is not 0.

So, what you do is, given a vector of iid data generated from a random variable \(Y\), compute the sample mean, \(\bar{y}\), and compare the relative likelihood that the data came from \(Normal(\bar{y},\sigma)\) vs \(Normal(0,\sigma)\). In other words, the observed sample mean is used to posit an alternative distribution post-hoc. This means that you give a location parameter to your alternative distribution after the fact, after you have seen the data. Could it be any different? Yes, you could have pre-specified a \(\mu\) before running the experiment. For example, we could have pre-specified the alternative hypothesis mean:

\(H_a: X \sim Normal(0.1,\sigma)\)

As a concrete example, suppose we know that \(\sigma=1\). We could then have:

\(H_0: X \sim Normal(0,1)\)

\(H_a: X \sim Normal(0.1,1)\)

For a sample size 10, our power is going to be about 6%:

power.t.test(d=.1,n=10,sd=1,type="one.sample",alternative="two.sided",strict=TRUE)
## 
##      One-sample t test power calculation 
## 
##               n = 10
##           delta = 0.1
##              sd = 1
##       sig.level = 0.05
##           power = 0.0592903
##     alternative = two.sided

Now suppose I sample 10 data points from the alternative distribution—i.e., the null is in fact false. I also compute the sample mean, and then do a likelihood ratio test using the observed mean, not the hypothesized alternative mean of 0.1:

set.seed(4321)
y<-rnorm(10,mean=0.1,sd=1)
(ybar<-mean(y))
## [1] 0.3810558
(D<- -2*log(prod(dnorm(y,mean=0,sd=1))/prod(dnorm(y,mean=ybar,sd=1))))
## [1] 1.452036
## alternatively:
(D<-2*(sum(dnorm(y,mean=ybar,sd=1,log=TRUE)) - sum(dnorm(y,mean=0,sd=1,log=TRUE))))
## [1] 1.452036

This D value is distributed as a Chi-squared distribution with degrees of freedom equal to the difference in the number of parameters estimated, which here is 1. If D is larger than the number below, we reject the null, otherwise we fail to reject. Here, we fail to reject:

## critical D value:
qchisq(0.05,df=1,lower.tail=FALSE)
## [1] 3.841459

We don’t do the above likelihood ratio test based on the a priori hypothesized alternative \(\mu = .1\), but we could do it:

(D<-2*(sum(dnorm(y,mean=.1,sd=1,log=TRUE)) - sum(dnorm(y,mean=0,sd=1,log=TRUE))))
## [1] 0.6621117

Either way, we can’t reject the null here. But when we sample the data Y, and the null hypothesis is in fact true, we can easily end up with large sample means like 1 or -1.

Suppose again that our null and alternative hypotheses have specific location parameters, as above:

\(H_0: X \sim Normal(0,1)\)

\(H_a: X \sim Normal(0.1,1)\)

When we run a single experiment, we could easily get a biased sample with an overly large mean. This can happen even if you run the experiment only once and the probability of getting an overly large mean is small. It’s the same as when you toss a coin once, with probability of heads 0.5, and end up with a heads. It wasn’t overwhelmingly likely to happen, but it can happen. As a statistician once said, Anything that can happen will happen.

So, let’s imagine a situation where we get a biased estimate of the population mean:

nsim<-100000
ybars<-rep(NA,nsim)
ylarge<-rep(NA,10)
for(i in 1:nsim){
  y<-rnorm(10,mean=0,sd=1)
  ybars[i] <- mean(y)
  if(ybars[i]>1.2){ylarge<-y}
}
## overly large estimate:
mean(ylarge)
## [1] 1.212319

We could get a large sample mean of 1.2. The t-test would reject the null (incorrectly).

ylarge
##  [1]  0.08338258  0.56041163 -0.07557929  0.73725109  2.04243316
##  [6]  0.92797687  2.06577006  2.86887155  2.23031663  0.68235265
mean(ylarge)
## [1] 1.212319
t.test(ylarge)
## 
##  One Sample t-test
## 
## data:  ylarge
## t = 3.8035, df = 9, p-value = 0.004195
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.4912805 1.9333568
## sample estimates:
## mean of x 
##  1.212319

Our likelihood ratio test would also give us reason to incorrectly reject the null:

(D<- -2*log(prod(dnorm(ylarge,mean=0,sd=1))/prod(dnorm(ylarge,mean=mean(ylarge),sd=1))))
## [1] 14.69717
pchisq(D,df=1,lower.tail=FALSE)
## [1] 0.0001262361

Note that we don’t plug in the a priori hypothesized alternative mean of 0.1. If we had done that, we would have no call to reject the null:

(D<- -2*log(prod(dnorm(ylarge,mean=0,sd=1))/prod(dnorm(ylarge,mean=.1,sd=1))))
## [1] 2.324637
pchisq(D,df=1,lower.tail=FALSE)
## [1] 0.1273399

We never hypothesize an alternative mean a priori. We always use the sample mean as the post-hoc proxy for the true (but unknown) alternative mean. And if the sample mean ends up being biased, the likelihood ratio test is going to give you a biased answer to the question: can we reject the null?

The problem here is again Type M error: even though the sample mean is a maximum likelihood estimate, it can be highly biased when power is low, as in this case where we have 6% power. We will accidentally get large sample means (biased estimates of the mean), and because of the statistical significance filter, we tend to report only those in publications. So, using the likelihood ratio test does not solve the problem of Type M error.