Search

Thursday, April 05, 2018

Type M error and likelihood ratio tests

Type M error also affects likelihood ratio tests

The background for this post is the following paper entitled The statistical significance filter leads to overoptimistic expectations of replicability (Vasishth, Mertzen, J"ager, Gelman, 2017, under review): https://psyarxiv.com/hbqcw

Abstract:

Treating a result as newsworthy, i.e., publishable, because the p-value is less than 0.05 leads to overoptimistic expectations of replicability. The underlying cause of these overoptimistic expectations is Type M(agnitude) error (Gelman & Carlin, 2014): when underpowered studies yield significant results, the effect size estimates are invariably exaggerated and noisy. These effects get published, leading to an illusion that the reported findings are robust and replicable. For the first time in psycholinguistics, we demonstrate the adverse consequences of this statistical significance filter. We do this by carrying out direct replication attempts of published results from a recent paper. Six experiments (self-paced reading and eyetracking, 168 participants in total) show that the published (statistically significant) claims are so noisy that even non-significant results are fully compatible with them. We also demonstrate the stark contrast between these small-sample studies and a larger-sample study (100 participants); the latter yields much less noisy estimates but also a much smaller magnitude of the effect of interest. The small magnitude looks less compelling but is more realistic. We suggest that psycholinguistics (i) move its focus away from statistical significance, (ii) attend instead to the precision of their estimates, and (iii) carry out direct replications in order to demonstrate the existence of an effect.

Someone suggested to me that the likelihood ratio test takes the alternative hypothesis into account, so it’s not just giving evidence against the null hypothesis, like the t-test is. I show below why Type M error coupled with the statistical significance filter ensures that it doesn’t matter whether we use t-tests or likelihood ratio tests. We are always comparing the null against an alternative hypothesis that can have a highly biased mean.

Note that in this whole discussion, I am only interested in low power situations.

Consider what you mean when you state a null hypothesis. When you write:

\(H_0: \mu=0\)

you are not making a statement only about a point value. You are making a distributional statement. You are saying that the true distribution of the sample mean is \(Normal(0,\sigma)\), where \(\sigma\) is the standard error estimated from the data. So why don’t we write

\(H_0: X \sim Normal(0,\sigma)\)?

where \(X\) is the random variable generating the data? I don’t know; someone more knowledgable than me can hopefully comment on this.

The alternative hypothesis then becomes:

\(H_a: X \sim Normal(\mu,\sigma), \mu\neq 0\)?

Note that at this stage, before collecting any data, we have no specific \(\mu\) in mind here, it’s any number that is not 0.

So, what you do is, given a vector of iid data generated from a random variable \(Y\), compute the sample mean, \(\bar{y}\), and compare the relative likelihood that the data came from \(Normal(\bar{y},\sigma)\) vs \(Normal(0,\sigma)\). In other words, the observed sample mean is used to posit an alternative distribution post-hoc. This means that you give a location parameter to your alternative distribution after the fact, after you have seen the data. Could it be any different? Yes, you could have pre-specified a \(\mu\) before running the experiment. For example, we could have pre-specified the alternative hypothesis mean:

\(H_a: X \sim Normal(0.1,\sigma)\)

As a concrete example, suppose we know that \(\sigma=1\). We could then have:

\(H_0: X \sim Normal(0,1)\)

\(H_a: X \sim Normal(0.1,1)\)

For a sample size 10, our power is going to be about 6%:

power.t.test(d=.1,n=10,sd=1,type="one.sample",alternative="two.sided",strict=TRUE)
## 
##      One-sample t test power calculation 
## 
##               n = 10
##           delta = 0.1
##              sd = 1
##       sig.level = 0.05
##           power = 0.0592903
##     alternative = two.sided

Now suppose I sample 10 data points from the alternative distribution—i.e., the null is in fact false. I also compute the sample mean, and then do a likelihood ratio test using the observed mean, not the hypothesized alternative mean of 0.1:

set.seed(4321)
y<-rnorm(10,mean=0.1,sd=1)
(ybar<-mean(y))
## [1] 0.3810558
(D<- -2*log(prod(dnorm(y,mean=0,sd=1))/prod(dnorm(y,mean=ybar,sd=1))))
## [1] 1.452036
## alternatively:
(D<-2*(sum(dnorm(y,mean=ybar,sd=1,log=TRUE)) - sum(dnorm(y,mean=0,sd=1,log=TRUE))))
## [1] 1.452036

This D value is distributed as a Chi-squared distribution with degrees of freedom equal to the difference in the number of parameters estimated, which here is 1. If D is larger than the number below, we reject the null, otherwise we fail to reject. Here, we fail to reject:

## critical D value:
qchisq(0.05,df=1,lower.tail=FALSE)
## [1] 3.841459

We don’t do the above likelihood ratio test based on the a priori hypothesized alternative \(\mu = .1\), but we could do it:

(D<-2*(sum(dnorm(y,mean=.1,sd=1,log=TRUE)) - sum(dnorm(y,mean=0,sd=1,log=TRUE))))
## [1] 0.6621117

Either way, we can’t reject the null here. But when we sample the data Y, and the null hypothesis is in fact true, we can easily end up with large sample means like 1 or -1.

Suppose again that our null and alternative hypotheses have specific location parameters, as above:

\(H_0: X \sim Normal(0,1)\)

\(H_a: X \sim Normal(0.1,1)\)

When we run a single experiment, we could easily get a biased sample with an overly large mean. This can happen even if you run the experiment only once and the probability of getting an overly large mean is small. It’s the same as when you toss a coin once, with probability of heads 0.5, and end up with a heads. It wasn’t overwhelmingly likely to happen, but it can happen. As a statistician once said, Anything that can happen will happen.

So, let’s imagine a situation where we get a biased estimate of the population mean:

nsim<-100000
ybars<-rep(NA,nsim)
ylarge<-rep(NA,10)
for(i in 1:nsim){
  y<-rnorm(10,mean=0,sd=1)
  ybars[i] <- mean(y)
  if(ybars[i]>1.2){ylarge<-y}
}
## overly large estimate:
mean(ylarge)
## [1] 1.212319

We could get a large sample mean of 1.2. The t-test would reject the null (incorrectly).

ylarge
##  [1]  0.08338258  0.56041163 -0.07557929  0.73725109  2.04243316
##  [6]  0.92797687  2.06577006  2.86887155  2.23031663  0.68235265
mean(ylarge)
## [1] 1.212319
t.test(ylarge)
## 
##  One Sample t-test
## 
## data:  ylarge
## t = 3.8035, df = 9, p-value = 0.004195
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.4912805 1.9333568
## sample estimates:
## mean of x 
##  1.212319

Our likelihood ratio test would also give us reason to incorrectly reject the null:

(D<- -2*log(prod(dnorm(ylarge,mean=0,sd=1))/prod(dnorm(ylarge,mean=mean(ylarge),sd=1))))
## [1] 14.69717
pchisq(D,df=1,lower.tail=FALSE)
## [1] 0.0001262361

Note that we don’t plug in the a priori hypothesized alternative mean of 0.1. If we had done that, we would have no call to reject the null:

(D<- -2*log(prod(dnorm(ylarge,mean=0,sd=1))/prod(dnorm(ylarge,mean=.1,sd=1))))
## [1] 2.324637
pchisq(D,df=1,lower.tail=FALSE)
## [1] 0.1273399

We never hypothesize an alternative mean a priori. We always use the sample mean as the post-hoc proxy for the true (but unknown) alternative mean. And if the sample mean ends up being biased, the likelihood ratio test is going to give you a biased answer to the question: can we reject the null?

The problem here is again Type M error: even though the sample mean is a maximum likelihood estimate, it can be highly biased when power is low, as in this case where we have 6% power. We will accidentally get large sample means (biased estimates of the mean), and because of the statistical significance filter, we tend to report only those in publications. So, using the likelihood ratio test does not solve the problem of Type M error.

7 comments:

Unknown said...

Great post. I fear that these conclusions hold for essentially all commonly used statistical measures, whether these are parameter estimates or hypotheses tests.

Scott Glover said...

"Someone suggested to me that the likelihood ratio test takes the alternative hypothesis into account, so it’s not just giving evidence against the null hypothesis, like the t-test is."

Someone was wrong if they were implying the LR is somehow magically resistant to sampling error because it's a symmetrical test.

The LR based on the MLE is directly related to t: LR = (1+t^2/(df-1))^(n/2), and thus inversely related to p, so naturally it's going to be susceptible to the same sampling issues as t and p are. So, whenever you find a misleadingly small p you will also find a misleadingly large LR. Can't be avoided because the two are inversely related.

The advantage of the LR as a statistic is that the evidence is presented in a much more intuitive way with the LR than with p.

Also, I'm not sure what your test of Ha = .1 is meant to illustrate. If you set up your Ha very close to your Ho it's going to be hard to find (strong) evidence for either model.


Scott Glover said...

No likelihoodist I know would argue that the LR is somehow magically resistant to sampling error simply by dint of it being a symmetrical test. In fact, the LR based on MLE is inversely related to p and so will share many of its properties, for good or bad.

The advantage a symmetrical statistic like the LR has over p is that it's simply a more intuitive way to express the strength of the evidence.

Also, I'm not sure what your test of Ha = .1 is meant to illustrate. If you set up your Ha very close to the Ho, it's going to be hard to find (strong) evidence for either one.

Shravan Vasishth said...

Thanks for both comments, Scott. With Ha: mu = .1 I meant to illustrate the low power situation, which often occurs in my field. People will repeatedly find null results and claim they showed that mu = 0. Or they will get a sig effect and claim they found an effect, but then no-one else can find it. I'm focusing on situations where power is low, esp. on situations where you get a significant effect when you have an exaggerated estimate.

Scott Glover said...

Hi Shravan,

Thanks and sorry that was a double post by accident- edited and then posted again lol.

I see what you mean about low power. No easy solution to that unfortunately, except to increase power.

One positive quality about the LR is it describes the evidence more directly than a (misunderstood) p-value. I.e., a p-value of .045 might sound compelling to some but I doubt the corresponding LR of ~ 2.8 would. In these kinds of scenarios, using the LR might lead to people being more reticent to make strong claims based on weak evidence.

Shravan Vasishth said...

what do you think about reducing type 1 error to 0.001 say?

Scott Glover said...

As a rule it doesn't strike me as practical really because of the cost in n.

More generally, there is an inherent problem with power calcs inasmuch as one never really knows the size of the effect and/or variance they are dealing with (if you did, you wouldn't need to do the experiment).

What I typically do is set up each exp't with a minimum effect size in mind that I would find interesting/important, take a guess at what the variance might be, and then use that to guess at the n I will need to find said effect most (~80%) of the time. And by 'find' I mean a likelihood ratio of around 10:1 or better (about p <= .015).