Tuesday, November 22, 2016

Statistics textbooks written by non-statisticians: Generally a Bad Idea

A methodologist from psychology called Russell Warne writes on twitter:

It is of course correct that you can usually increase power by increasing sample size. 

But a lot of the other stuff in this paragraph is wrong or misleading. If this is an introductory statistics textbook for psychologists, it will cause a lot of harm: a whole new generation of psychologists will emerge with an incorrect understanding of the frequentist point of view to inference. Here are some comments on his text:
  1. "When a study has low statistical power, it raises the possibility that any rejection of the null hypothesis is just a fluke, i.e., a Type I error": A fluke rejection of a null hypothesis, isn't that the definition of Type I error? So, low power raises the possibility that a rejection is a Type I error? There is so much wrong here. First of all, Type I error is associated with hypothetical replications of the experiment. It is a statement about the long run repetitions of the procedure, not about the specific experiment you did. You cannot talk of a particular result being a "Type I error" or not. Second, the above sentence says that if power is low, you could end up with an incorrect rejection; the implication is that if power is high, I am unlikely to end up with an incorrect rejection! What the author should have said is that when power is low, by definition the probability of correctly detecting the effect is low. Punkt. Furthermore, the much more alarming consequence of low power is Type S and M errors (see my next point below). I'm surprised that psychologists haven't picked this up yet.
  2.  When power is low, "...the study should most likely not have been able to reject the null hypothesis at all. So, when it does reject the null hypothesis, it does not seem like a reliable result": I think that one word that should be banned in psych* is "reliable", it gives people the illusion that they found out something that is true. It is never going to be the case that you can say with 100% certainty that you found out the truth. If reliable means "true, reflecting reality correctly", you will *never* know that you have a reliable result. The trouble with using words like reliable is when people read a sentence like the one above and then try to construct the meaning of the sentence by considering the converse situation, when power is high. The implication is that when power is high, the rejection of the result is "reliable". I have lost count of how many times I have heard psych* people telling me that a result is "reliable", implying that they found something that is true of nature. Even when power is high, you still have a Type I error of whatever your $\alpha$ is. So any individual result you get could be an incorrect rejection; it doesn't matter what you think the power is. A further important point is: how do you *know* what power you have? Due to Type S and M errors, you are most likely doing your calculation based on previous, underpowered studies. You are therefore going to be getting gross overestimates of power anyway. Power is a function, and typically, you will have a lot of uncertainty associated with your estimate of the plausible values of power under different assumptions (after all, you don't *know* what the true effect is, right? If you know already, why are you doing the study?).  Giving a student the false security of saying "oh, I have high power, so my result is reliable" is pretty irresponsible and is part of the reason why we keep messing up again and again and again.