## Tuesday, November 22, 2016

### Statistics textbooks written by non-statisticians: Generally a Bad Idea

A methodologist from psychology called Russell Warne writes on twitter:

It is of course correct that you can usually increase power by increasing sample size.

But a lot of the other stuff in this paragraph is wrong or misleading. If this is an introductory statistics textbook for psychologists, it will cause a lot of harm: a whole new generation of psychologists will emerge with an incorrect understanding of the frequentist point of view to inference. Here are some comments on his text:
1. "When a study has low statistical power, it raises the possibility that any rejection of the null hypothesis is just a fluke, i.e., a Type I error": A fluke rejection of a null hypothesis, isn't that the definition of Type I error? So, low power raises the possibility that a rejection is a Type I error? There is so much wrong here. First of all, Type I error is associated with hypothetical replications of the experiment. It is a statement about the long run repetitions of the procedure, not about the specific experiment you did. You cannot talk of a particular result being a "Type I error" or not. Second, the above sentence says that if power is low, you could end up with an incorrect rejection; the implication is that if power is high, I am unlikely to end up with an incorrect rejection! What the author should have said is that when power is low, by definition the probability of correctly detecting the effect is low. Punkt. Furthermore, the much more alarming consequence of low power is Type S and M errors (see my next point below). I'm surprised that psychologists haven't picked this up yet.
2.  When power is low, "...the study should most likely not have been able to reject the null hypothesis at all. So, when it does reject the null hypothesis, it does not seem like a reliable result": I think that one word that should be banned in psych* is "reliable", it gives people the illusion that they found out something that is true. It is never going to be the case that you can say with 100% certainty that you found out the truth. If reliable means "true, reflecting reality correctly", you will *never* know that you have a reliable result. The trouble with using words like reliable is when people read a sentence like the one above and then try to construct the meaning of the sentence by considering the converse situation, when power is high. The implication is that when power is high, the rejection of the result is "reliable". I have lost count of how many times I have heard psych* people telling me that a result is "reliable", implying that they found something that is true of nature. Even when power is high, you still have a Type I error of whatever your $\alpha$ is. So any individual result you get could be an incorrect rejection; it doesn't matter what you think the power is. A further important point is: how do you *know* what power you have? Due to Type S and M errors, you are most likely doing your calculation based on previous, underpowered studies. You are therefore going to be getting gross overestimates of power anyway. Power is a function, and typically, you will have a lot of uncertainty associated with your estimate of the plausible values of power under different assumptions (after all, you don't *know* what the true effect is, right? If you know already, why are you doing the study?).  Giving a student the false security of saying "oh, I have high power, so my result is reliable" is pretty irresponsible and is part of the reason why we keep messing up again and again and again.

## Tuesday, August 02, 2016

### Two papers, with code: Statistical Methods for Linguistic Research (Parts 1 and 2)

Here are two papers that may be useful for researchers in psychology, linguistics, and cognitive science:

Shravan Vasishth and Bruno Nicenboim. Statistical methods for linguistic research: Foundational Ideas - Part I. Language and Linguistics Compass, 2016. In Press.

Bruno Nicenboim and Shravan Vasishth. Statistical methods for linguistics research: Foundational Ideas - Part II. Language and Linguistics Compass, 2016. In Press.
PDF:

## Wednesday, April 27, 2016

### A simple proof that the p-value distribution is uniform when the null hypothesis is true

[Scroll to graphic below if math doesn't render for you]

Thanks to Mark Andrews for correcting some crucial typos (I hope I got it right this time!).

Thanks also to Andrew Gelman for pointing out that the proof below holds only when the null hypothesis is a point null $H_0: \mu = 0$, and the dependent measure is continuous, such as reading time in milliseconds, or EEG responses.

Someone asked this question in my linear modeling class: why is it that the p-value has a uniform distribution when the null hypothesis is true? The proof is remarkably simple (and is called the probability integral transform).

First, notice that when a random variable Z comes from a $Uniform(0,1)$ distribution, then the probability that $Z$ is less than (or equal to) some value $z$ is exactly $z$: $P(Z\leq z)=z$.

Next, we prove the following proposition:

Proposition:
If a random variable $Z=F(T)$, then $Z \sim Uniform(0,1)$.

Note here that the p-value is a random variable, call it $Z$. The p-value is computed by calculating the probability of seeing a t-statistic or something more extreme under the null hypothesis. The t-statistic comes from a random variable $T$ that is a transformation of the random variable $\bar{X}$: $T=(\bar{X}-\mu)/(\sigma/\sqrt{n})$. This random variable T has a CDF $F$.

So, if we can prove the above proposition, we have shown that the p-value's distribution under the null hypothesis is $Uniform(0,1)$.

Proof:

Let $Z=F(T)$.

$P(Z\leq z) = P(F(T)\leq z) = P(F^{-1} F(T) \leq F^{-1}(z) ) = P(T \leq F^{-1} (z) ) = F(F^{-1}(z))= z$.

Since $P(Z\leq z)=z$, Z is uniformly distributed, that is, Uniform(0,1).

A screengrab in case the above doesn't render:

## Sunday, January 17, 2016

### Automating R exercises and exams using the exams package

It's a pain to design statistics exercises each semester, and because students from previous share old exercises with the new incoming students, it's hard to design simple exercises that students haven't already seen the answers to. On top of that, some students try to cheat during the exam by looking over the shoulder of their neighbors. Homework exercises almost always involve collaboration even if you prohibit it.

It turns out that you can automate the generation of fixed-format exercises (with different numerical answers being required each time). You can also randomly select questions from a question bank you create yourself. And you can even create a unique question paper for each student in an exam, to make cheating between neighbors essentially impossible (even if they copy the correct answer to question 2 from a neighbor, they end up answering the wrong question on their own paper).

All this magic is made possible by the exams package in R. The documentation is of course comprehensive and there is a journal article explaining everything:
Achim Zeileis, Nikolaus Umlauf, Friedrich Leisch (2014). Flexible Generation of E-Learning Exams in R: Moodle Quizzes, OLAT Assessments, and Beyond. Journal of Statistical Software 58(1), 1-36. URL http://www.jstatsoft.org/v58/i01/.
I also use this package to deliver auto-graded exercises to students over datacamp.com. See here for the course I teach, and here for the datacamp exercises.

Here is a quick example to get people started on designing their own customized, automated exams. In my example below, there are several files you need.

1. The template files for your exam (what your exam or homework sheet will look like), and the solutions file. I provide two example files: test.tex and solutiontest.tex

2. The exercises or exam questions themselves: I provide two as examples. The first file is called pnorm1.Rnw. It's an Sweave file, and it contains the code for generating a random problem and for generating its solution. The code should be self-explanatory. The second file is called sesamplesize1multiplechoice.Rnw and has a multiple choice question.

3.  The exam generating R code file: The code is commented and self-explanatory. It will generate the exercises, randomize the order of presentation (if there are two or more exercises), and generate a solutions file. The output will be a single or multiple exam papers (depending on how many versions you wanted generated), and the solutions file(s).  Notice the cool thing that even in my example, with only one question, the two versions of the exams have different numbers, so two people cannot collaborate and consult each other and just write down one answer.  Each student could in principle be given a unique set of exercises, although it would be a lot of work to grade it if you do it manually.

Here is the exam generating code:

Save from the gists given above (a) the test.tex and solutiontest.tex files, (b) the Rnw files containing the exercise (pnorm1.Rnw, and sesamplesize1multiplechoice.Rnw), and (c) the exam generating code (ExampleExamCode.R).  Put all of these into your working directory, say ExampleExam. Then run the R code, and be amazed.

If something is broken in my example, please let me know.

Shuffling questions: If you want to reorder the questions in each run of the R code, just change myexamlist to sample(myexamlist) in the call below that appears in the file ExampleExamCode.R:

sol <- exams(sample(myexamlist), n = num.versions,
dir = odir, template = c("test", "solutiontest"),
nsamp=1,
header = list(ID = getID, Date = Sys.Date()))


## Wednesday, January 06, 2016

### My MSc thesis: A meta-analysis of relative clause processing in Mandarin Chinese using bias modelling

Here is my MSc thesis, which was submitted to the University of Sheffield in September 2015.

The pdf is here.

Title: A Meta-analysis of Relative Clause Processing in Mandarin Chinese using Bias Modelling

Abstract
The reading difficulty associated with Chinese relative clauses presents an important empirical problem for psycholinguistic research on sentence comprehension processes. Some studies show that object relatives are easier to process than subject relatives, while others show the opposite pattern. If Chinese has an object relative advantage, this has important implications for theories of reading comprehension.  In order to clarify the facts about Chinese, we carried out a Bayesian random-effects meta-analysis using 15 published studies; this analysis showed that the posterior probability of a subject relative advantage is approximately $0.77$ (mean $16$, 95% credible intervals $-29$ and $61$ ms). Because the studies had significant biases, it is possible that they may have confounded the results. Bias modelling is a potentially important tool in such situations because it uses expert opinion to incorporate the biases in the model. As a proof of concept, we first identified biases in five of the fifteen studies, and elicited priors on these using the SHELF framework. Then we fitted a random-effects meta-analysis, including priors on biases. This analysis showed a stronger posterior probability ($0.96$) of a subject relative advantage compared to the standard random-effects meta-analysis (mean $33$, credible intervals $-4$ and $71$).

## Saturday, December 19, 2015

### Best statistics-related comment ever from a reviewer

This is the most interesting comment I have ever received from CUNY conference reviewing. It nicely illustrates the state of our understanding of statistical theory in psycholinguistics:

"I had no idea how many subjects each study used. Were just one or two people
used? ... Generally, I wasn't given enough data to determine my confidence in the provided t-values (which depends on the degrees of freedom involved)."

## Thursday, August 27, 2015

### Five thirty-eight provides a brand new definition of the p-value

The Five-Thirty Eight blog provides a brand new definition of the p-value:

"A p-value is simply the probability of getting a result at least as extreme as the one you saw if your hypothesis is false."

I thought this blog was run by Nate Silver, a statistician?

### Observed vs True Statistical Power, and the power inflation index

People (including me) routinely estimate statistical power for future studies using a pilot study's data or a previously published study's data (or perhaps using the predictions from a computational model, such as Engelmann et al 2015).

Indeed, the author of the Replicability Index has been using observed power to determine the replicability of journal articles. His observed power estimates are HUGE (in the range of 0.75) and seem totally implausible to me, given the fact that I can hardly ever replicate my studies.

This got me thinking: Gelman and Carlin have shown that when power is low, Type M error will be high. That is, the observed effects will tend to be highly exaggerated. The issue with Type M error is easy to visualize.

Suppose that a particular study has standard error 46, and sample size 37; this implies that standard deviation is $46\times \sqrt{37}= 279$. These are representative numbers from psycholinguistic studies. Suppose also that we know that the true effect (the absolute value, say on the millisecond scale for a reading study---thanks to Fred Hasselman) is D=15. Then, we can compute Type S and Type M errors for replications of this particular study by repeatedly sampling from the true distribution.

We can visualize the exaggerated effects under low power as follows (see below): On the x-axis you see the effect magnitudes, and on the y-axis is power. The red line is the power line of 0.20, which based on my own attempts at replicating my own studies (and mostly failing), I estimate to be an upper bound of the power of experiments in psycholinguistics (this is an upper bound, I think a more common value will be closer to 0.05). All those dots below the red line are exaggerated estimates in a low power situation, and if you were to use any of those points to estimate observed power, you would get a wildly optimistic power estimate which has no bearing with reality.

What does this fact about Type M error imply for Replicability Index's calculations? It implies that if power is in fact very low, and if journals are publishing larger-than-true effect sizes (and we know that they have an incentive to do so, because editors and reviewers mistakenly think that lower p-values, i.e., bigger absolute t-values, give stronger evidence for the specific alternative hypothesis of interest), then Replicability Index is possibly hugely overestimating power, and therefore hugely overestimating replicability of results.

I came up with the idea of framing this overestimation in terms of Type M error by defining something called a power inflation index. Here is how it works. For different "true" power levels, we repeatedly sample data, and compute observed power each time. Then, for each "true" power level, we can compute the ratio of the observed power to the true power in each case. The mean of this ratio is the power inflation index, and the 95% confidence interval around it gives us an indication (sorry Richard Morey! I know I am abusing the meaning of CI here and treating it like a credible interval!) of how badly we could overestimate power from a small sample study.

Here is the code for simulating and visualizing the power inflation index:

And here is the visualization:

What we see here is that if true power is as low as 0.05 (and we can never know that it is not since we never know the true effect size!), then using observed power can lead to gross overestimates by a factor of approximately 10! So, if Replicability Index reports an observed power of 0.75, what he might actually be looking at is an inflated estimate where true power is 0.08.

In summary, we can never know true power, and if we are estimating it using observed power conditional on true power being extremely low, we are likely to hugely overestimate power.

One way to test my claim is to actually try to replicate the studies that Replicability Index predicts has high replicability. My prediction is that his estimates will be wild overestimates and most studies will not replicate.

Postscript

A further thing that worries me about Replicability Index is his sloppy definitions of statistical terms. Here is how he defines power:

"Power is defined as the long-run probability of obtaining significant results in a series of exact replication studies. For example, 50% power means that a set of 100 studies is expected to produce 50 significant results and 50 non-significant results."

[Thanks to Karthik Durvasula for correcting my statement below!]
By not defining power of a test of a null hypothesis $H_0: \mu=k$, as the probability of rejecting the null hypothesis (as a function of different alternative $\mu$ such that $\mu\neq k$) when it is false, what this definition literally means is that if I sample from any distribution, including one where $\mu=0$, the probability of obtaining a significant result under repeated sampling is the power. Which of course is completely false.

Post-Post Script

Replicability Index points out in a tweet that his post-hoc power estimation corrects for inflation. But post-hoc power corrected for inflation requires knowledge of the true power, which is what we are trying to get at in the first place. How do you deflate "observed" power when you don't know what the true power is? Maybe I am missing something.