## Tuesday, November 22, 2016

### Statistics textbooks written by non-statisticians: Generally a Bad Idea

A methodologist from psychology called Russell Warne writes on twitter:

It is of course correct that you can usually increase power by increasing sample size.

But a lot of the other stuff in this paragraph is wrong or misleading. If this is an introductory statistics textbook for psychologists, it will cause a lot of harm: a whole new generation of psychologists will emerge with an incorrect understanding of the frequentist point of view to inference. Here are some comments on his text:
1. "When a study has low statistical power, it raises the possibility that any rejection of the null hypothesis is just a fluke, i.e., a Type I error": A fluke rejection of a null hypothesis, isn't that the definition of Type I error? So, low power raises the possibility that a rejection is a Type I error? There is so much wrong here. First of all, Type I error is associated with hypothetical replications of the experiment. It is a statement about the long run repetitions of the procedure, not about the specific experiment you did. You cannot talk of a particular result being a "Type I error" or not. Second, the above sentence says that if power is low, you could end up with an incorrect rejection; the implication is that if power is high, I am unlikely to end up with an incorrect rejection! What the author should have said is that when power is low, by definition the probability of correctly detecting the effect is low. Punkt. Furthermore, the much more alarming consequence of low power is Type S and M errors (see my next point below). I'm surprised that psychologists haven't picked this up yet.
2.  When power is low, "...the study should most likely not have been able to reject the null hypothesis at all. So, when it does reject the null hypothesis, it does not seem like a reliable result": I think that one word that should be banned in psych* is "reliable", it gives people the illusion that they found out something that is true. It is never going to be the case that you can say with 100% certainty that you found out the truth. If reliable means "true, reflecting reality correctly", you will *never* know that you have a reliable result. The trouble with using words like reliable is when people read a sentence like the one above and then try to construct the meaning of the sentence by considering the converse situation, when power is high. The implication is that when power is high, the rejection of the result is "reliable". I have lost count of how many times I have heard psych* people telling me that a result is "reliable", implying that they found something that is true of nature. Even when power is high, you still have a Type I error of whatever your $\alpha$ is. So any individual result you get could be an incorrect rejection; it doesn't matter what you think the power is. A further important point is: how do you *know* what power you have? Due to Type S and M errors, you are most likely doing your calculation based on previous, underpowered studies. You are therefore going to be getting gross overestimates of power anyway. Power is a function, and typically, you will have a lot of uncertainty associated with your estimate of the plausible values of power under different assumptions (after all, you don't *know* what the true effect is, right? If you know already, why are you doing the study?).  Giving a student the false security of saying "oh, I have high power, so my result is reliable" is pretty irresponsible and is part of the reason why we keep messing up again and again and again.

#### 13 comments:

Moritz Körber said...

> You cannot talk of a particular result being a "Type I error" or not.

You can if you know the truth. Now you confuse Type 1 error & Type 1 error rate :)

> What the author should have said is that when power is low, by definition the
> probability of correctly detecting the effect is low.

I think the author meant low power leads to a high false discovery rate (= low ppv), which is true. That also counts for your second comment. I think I would be more important to know that any study will be significant if sample size is high enough regardless of actual effect or practical significance. maybe you might want to add this!

Shravan Vasishth said...

Hi Moritz, regarding your first point, can you point me to one experiment where you already knew the truth? ;) Also, why did you do the experiment then? Sure, you're right that one can talk about error rates in this sense, but I am talking about the case where you have done an experiment and are now trying reason about what it means.

Moritz Körber said...

Yes, you are right, the definition of a type 1 error can hardly be applied to a real life study. just wanted to point out that error rate and making a type 1 error are not the same conceptually. Hope the article you commented on gets revised.

Ulrich Schimmack said...

You have written may negative comments on twitter without backing them up so it is great that you finally made some arguments.

On the first argument, I agree that it is confusing to link power (1 - type II error) to type-I errors.

The only way to relate the two is that power and type-II errors depend on the a priori specification of alpha (cond. prob. of making a type-I error, if null-hypothesis is true).

If I lower alpha, I decrease power. If I increase alpha, I increase power.

But for any given alpha, the conditional probability of making a type I error IN A SINGLE STUDY is specified a priori and not dependent on power, sample size, effec size, etc.

=====================================

At the same time, we can ask what researchers do when they test an important hypothesis and get a non-significant result. First, they might conclude that there is no effect. Wrong! A non-significant result might be due to a type-II error, ESPECIALLY when the study has LOW POWER. So, a logical thing to do would be to do another study with more power. But now, the type-I error increased because there is now more than one chance to get a significant result. This problem would not occur if the first study had HIGH POWER and produced a true positive result. No second study is needed. So, HIGH POWER means fewer tests of the same hypothesis will be conducted and as a result the probability (not the conditional probability alpha) of making a type-I error is reduced.

Shravan Vasishth said...

Hi Ulrich, I agree with everything up to before the point when you say:

"But now, the type-I error increased because there is now more than one chance to get a significant result."

I don't understand what you mean here. When I run the second study, my Type I error changes? How is that? My Type I error in the second study will be whatever I set alpha to be, say 0.05.

Ulrich Schimmack said...

It is all about language and communication

alpha = conditional probability of making a type I error given the null-hypothesis

p(type-I error) = the actual probability of making a type I error in a set of real data where the null-hypothesis is true for some studies.

So, we need to distinguish conditional probabilty given H0 and probabilty.

Shravan Vasishth said...

I think that you have come up with this idea of an "actual probability of making a Type I error" all by yourself. It isn't part of the logic of frequentist methodology. If you can show me one math stats text that discusses this concept I will be interested in following up.

Also, there is no concept of "making a Type I error" with reference to a given, specific study or set of studies. Type I error is by definition equal to alpha (so your usage of Type I error to define alpha is also very odd). Type I error is the probability of rejecting the null hypothesis given that it is true (I know you know this) under repeated sampling. It's a statement about the long-run properties of the experiment. We cannot look at a particular study and ask, what is the probability of *this* study making a Type I error. This is what you seem to be doing.

It's possible I have misunderstood something of course.

Ulrich Schimmack said...

This may help to understand my point.

http://shinyapps.org/showapp.php?app=http://87.106.45.173:3838/felix/PPV&by=Michael%20Zehetleitner%20and%20Felix%20Sch%C3%B6nbrodt&title=When%20does%20a%20significant%20p-value%20indicate%20a%20true%20effect?&shorttitle=When%20does%20a%20significant%20p-value%20indicate%20a%20true%20effect?

Shravan Vasishth said...

I'll look into the PPV, I admit that this is new to me. But do you agree that a single p-value from one study cannot *ever* tell you that an effect is "true"?

Ulrich Schimmack said...

Of course. A single p-value can only tell you how likely it was to obtain the result or an even more extreme one by chance alone.

The lower this probability, the less likely that it is a chance finding.

If p < 1:1 billion, I think it unlikely that researchers did 1 billion studies or p-hacked the shit out of a data set to report p < .05.

But we can never be sure.

Matt said...

I totally agree with the general sentiment (that statistics textbooks should ideally be written by statisticians). I'll avoid naming names, but I can think of several very widely used statistics texts by non-statisticians that have a lot of dubious material in them.

But in this particular case, I don't think you've identified an error per se. Basically, what the author is saying is that the posterior probability that H1 is true is higher if you have high power + a significant p value than if you have low power + a significant p value. (Ceteris paribus). This is correct, and indeed quite trivially so.

You could perhaps argue here that there's a conceptual confusion in the sense of this representing a combination of Bayesian thinking and frequentist methods, and that trying to reframe this issue in a purely frequentist manner is somewhat difficult. But this definitely isn't a case of a non-statistician falling for some trivial misconception about statistics. The ones to watch out for are people saying just flat-out wrong things like "regression assumes a normally distributed dependent variable" :/

Matt McBee said...

I assumed that Warne was talking about PPV in his intro. And it is true that low power reduces the PPV. But power is a relatively weak determinant of PPV. Alpha matters much more. Psych would be in a much better place if the consensual alpha was .01 instead of .05.

Shravan Vasishth said...

"Of course. A single p-value can only tell you how likely it was to obtain the result or an even more extreme one by chance alone.

The lower this probability, the less likely that it is a chance finding.

If p < 1:1 billion, I think it unlikely that researchers did 1 billion studies or p-hacked the shit out of a data set to report p < .05.

But we can never be sure."

Your sarcasm is noted. But note also that you have misunderstood the definition of the p-value and misunderstood what it means (a fairly typical situation in psych and ling).

A single p-value can only tell you how likely it was to obtain the *statistic* (not the result) or an even more extreme one by chance alone, ASSUMING THE NULL IS TRUE.

I know that people just say "by chance alone" to mean "by chance alone assuming the null is true", but they then proceed to make a linguistic error by concluding that their result is *not* by chance, i.e., reliable. And then the confusion that you are facing right now is where we inevitably end up.

The p-value gives you an answer about the probability of seeing the statistic you get (or something more extreme) assuming the null hypothesis is true. It tells you *nothing* about the specific hypothesis. The mistake you are making is thinking: oh, the p-value is super low, that tells me that the probability that my specific hypothesis (what you call the result) occurred by chance is nearly zero. WRONG, WRONG, WRONG, as Trump would say. The p-value gives you no information about the specific hypothesis you are interested in, it only allows you to confidently reject the hypothesis that mu=0 or whatever your null is. It is answering a question, but it is answering the wrong question.

So your sneering comment about 1 in a billion is really barking up the wrong tree. You can get excited that you are really really sure that you rejected the null, I am with you there. But that leaves me with no information about my specific hypothesis. For that, you have go look at the sample mean and pretend your CI is telling you something about your uncertainty of the estimate of the true value. You are trying to make the omelette without breaking the eggs.