tag:blogger.com,1999:blog-216211082017-03-18T01:06:45.982+01:00Shravan Vasishth's Slog (Statistics blog)This blog is a repository of cool things relating to statistical computing, simulation and stochastic modeling.Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.comBlogger59125tag:blogger.com,1999:blog-21621108.post-82992070478211113292016-11-22T12:26:00.002+01:002016-11-22T12:26:35.813+01:00Statistics textbooks written by non-statisticians: Generally a Bad IdeaA methodologist from psychology called Russell Warne writes on twitter: <br /><br /><blockquote class="twitter-tweet" data-lang="en"><div dir="ltr" lang="en">Paragraph in my upcoming introductory <a href="https://twitter.com/hashtag/statistics?src=hash">#statistics</a> textbook that I wouldn't have written if I hadn't followed <a href="https://twitter.com/R__INDEX">@R__INDEX</a> on Twitter. <a href="https://t.co/Bup9qcFzJg">pic.twitter.com/Bup9qcFzJg</a></div>— Russell Warne (@Russwarne) <a href="https://twitter.com/Russwarne/status/800933552023965697">November 22, 2016</a></blockquote><br /><script async="" charset="utf-8" src="//platform.twitter.com/widgets.js"></script><br /><br />It is of course correct that you can usually increase power by increasing sample size. <br /><br />But a lot of the other stuff in this paragraph is wrong or misleading. If this is an introductory statistics textbook for psychologists, it will cause a lot of harm: a whole new generation of psychologists will emerge with an incorrect understanding of the frequentist point of view to inference. Here are some comments on his text:<br /><ol><li>"<b>When a study has low statistical power, it raises the possibility that any rejection of the null hypothesis is just a fluke, i.e., a Type I error</b>": A fluke rejection of a null hypothesis, isn't that the definition of Type I error? So, low power raises the possibility that a rejection is a Type I error? There is so much wrong here. First of all, Type I error is associated with hypothetical replications of the experiment. It is a statement about the long run repetitions of the procedure, not about the specific experiment you did. You cannot talk of a particular result being a "Type I error" or not. Second, the above sentence says that if power is low, you could end up with an incorrect rejection; the implication is that if power is high, I am unlikely to end up with an incorrect rejection! What the author should have said is that when power is low, by definition the probability of correctly detecting the effect is low. Punkt. Furthermore, the much more alarming consequence of low power is Type S and M errors (see my next point below). I'm surprised that psychologists haven't picked this up yet.</li><li> When power is low, "<b>...the study should most likely not have been able to reject the null hypothesis at all</b>. <b>So, when it does reject the null hypothesis, it does not seem like a reliable result</b>": I think that one word that should be banned in psych* is "reliable", it gives people the illusion that they found out something that is true. It is never going to be the case that you can say with 100% certainty that you found out the truth. If reliable means "true, reflecting reality correctly", you will *never* know that you have a reliable result. The trouble with using words like reliable is when people read a sentence like the one above and then try to construct the meaning of the sentence by considering the converse situation, when power is high. The implication is that when power is high, the rejection of the result is "reliable". I have lost count of how many times I have heard psych* people telling me that a result is "reliable", implying that they found something that is true of nature. Even when power is high, you still have a Type I error of whatever your $\alpha$ is. So any individual result you get could be an incorrect rejection; it doesn't matter what you think the power is. A further important point is: how do you *know* what power you have? Due to Type S and M errors, you are most likely doing your calculation based on previous, underpowered studies. You are therefore going to be getting gross overestimates of power anyway. Power is a function, and typically, you will have a lot of uncertainty associated with your estimate of the plausible values of power under different assumptions (after all, you don't *know* what the true effect is, right? If you know already, why are you doing the study?). Giving a student the false security of saying "oh, I have high power, so my result is reliable" is pretty irresponsible and is part of the reason why we keep messing up again and again and again.</li></ol>Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com13tag:blogger.com,1999:blog-21621108.post-24697335254479309782016-08-02T16:58:00.000+02:002016-08-02T16:59:23.828+02:00Two papers, with code: Statistical Methods for Linguistic Research (Parts 1 and 2)Here are two papers that may be useful for researchers in psychology, linguistics, and cognitive science:<br /><br /><span style="color: #222222; display: inline; float: none; font-family: "raleway" , "helveticaneue" , "helvetica neue" , "helvetica" , "arial" , sans-serif; font-size: 18.75px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 30px; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">Shravan Vasishth and Bruno Nicenboim.<span class="Apple-converted-space"> </span></span><b style="-webkit-text-stroke-width: 0px; color: #222222; font-family: Raleway, HelveticaNeue, "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 18.75px; font-style: normal; font-variant: normal; font-weight: bold; letter-spacing: normal; line-height: 30px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">Statistical methods for linguistic research: Foundational Ideas - Part I</b><span style="color: #222222; display: inline; float: none; font-family: "raleway" , "helveticaneue" , "helvetica neue" , "helvetica" , "arial" , sans-serif; font-size: 18.75px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 30px; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">.<span class="Apple-converted-space"> </span></span><i style="-webkit-text-stroke-width: 0px; color: #222222; font-family: Raleway, HelveticaNeue, "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 18.75px; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 30px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">Language and Linguistics Compass</i><span style="color: #222222; display: inline; float: none; font-family: "raleway" , "helveticaneue" , "helvetica neue" , "helvetica" , "arial" , sans-serif; font-size: 18.75px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 30px; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">, 2016. In Press.<span class="Apple-converted-space"> </span></span><br /><span style="color: #222222; display: inline; float: none; font-family: "raleway" , "helveticaneue" , "helvetica neue" , "helvetica" , "arial" , sans-serif; font-size: 18.75px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 30px; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span class="Apple-converted-space">PDF: <a href="http://bit.ly/VasNicPart1">http://bit.ly/VasNicPart1</a></span></span><br /><span style="color: #222222; display: inline; float: none; font-family: "raleway" , "helveticaneue" , "helvetica neue" , "helvetica" , "arial" , sans-serif; font-size: 18.75px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 30px; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span class="Apple-converted-space">Code: <a href="http://bit.ly/VasNicPart1Code">http://bit.ly/VasNicPart1Code</a></span></span><br /><span style="color: #222222; display: inline; float: none; font-family: "raleway" , "helveticaneue" , "helvetica neue" , "helvetica" , "arial" , sans-serif; font-size: 18.75px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 30px; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">Bruno Nicenboim and Shravan Vasishth.<span class="Apple-converted-space"> </span></span><b style="-webkit-text-stroke-width: 0px; color: #222222; font-family: Raleway, HelveticaNeue, "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 18.75px; font-style: normal; font-variant: normal; font-weight: bold; letter-spacing: normal; line-height: 30px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">Statistical methods for linguistics research: Foundational Ideas - Part II</b><span style="color: #222222; display: inline; float: none; font-family: "raleway" , "helveticaneue" , "helvetica neue" , "helvetica" , "arial" , sans-serif; font-size: 18.75px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 30px; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">.<span class="Apple-converted-space"> </span></span><i style="-webkit-text-stroke-width: 0px; color: #222222; font-family: Raleway, HelveticaNeue, "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 18.75px; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 30px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">Language and Linguistics Compass</i><span style="color: #222222; display: inline; float: none; font-family: "raleway" , "helveticaneue" , "helvetica neue" , "helvetica" , "arial" , sans-serif; font-size: 18.75px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 30px; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">, 2016. In Press.</span><br /><span style="color: #222222; display: inline; float: none; font-family: "raleway" , "helveticaneue" , "helvetica neue" , "helvetica" , "arial" , sans-serif; font-size: 18.75px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 30px; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">PDF: </span><span style="color: #222222; display: inline; float: none; font-family: "raleway" , "helveticaneue" , "helvetica neue" , "helvetica" , "arial" , sans-serif; font-size: 18.75px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 30px; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="color: #222222; display: inline; float: none; font-family: "raleway" , "helveticaneue" , "helvetica neue" , "helvetica" , "arial" , sans-serif; font-size: 18.75px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 30px; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span class="Apple-converted-space"> <a href="http://bit.ly/NicVasPart2">http://bit.ly/NicVasPart2</a></span></span></span><br /><span style="color: #222222; display: inline; float: none; font-family: "raleway" , "helveticaneue" , "helvetica neue" , "helvetica" , "arial" , sans-serif; font-size: 18.75px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 30px; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">Code: <a href="http://bit.ly/NicVasPart2Code" target="_blank">http://bit.ly/NicVasPart2Code</a> </span>Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com0tag:blogger.com,1999:blog-21621108.post-77038624051507387122016-04-27T06:18:00.002+02:002016-05-07T08:39:13.862+02:00A simple proof that the p-value distribution is uniform when the null hypothesis is true[Scroll to graphic below if math doesn't render for you]<br /><br />Thanks to Mark Andrews for correcting some crucial typos (I hope I got it right this time!). <br /><br />Thanks also to Andrew Gelman for pointing out that the proof below holds only when the null hypothesis is a point null $H_0: \mu = 0$, and the dependent measure is continuous, such as reading time in milliseconds, or EEG responses.<br /><br />Someone asked this question in my linear modeling class: why is it that the p-value has a uniform distribution when the null hypothesis is true? The proof is remarkably simple (and is called the probability integral transform).<br /><br />First, notice that when a random variable Z comes from a $Uniform(0,1)$ distribution, then the probability that $Z$ is less than (or equal to) some value $z$ is exactly $z$: $P(Z\leq z)=z$.<br /><br />Next, we prove the following proposition:<br /><br /><b>Proposition</b>: <br />If a random variable $Z=F(T)$, then $Z \sim Uniform(0,1)$.<br /><br />Note here that the p-value is a random variable, call it $Z$. The p-value is computed by calculating the probability of seeing a t-statistic or something more extreme under the null hypothesis. The t-statistic comes from a random variable $T$ that is a transformation of the random variable $\bar{X}$: $T=(\bar{X}-\mu)/(\sigma/\sqrt{n})$. This random variable T has a CDF $F$.<br /><br />So, if we can prove the above proposition, we have shown that the p-value's distribution under the null hypothesis is $Uniform(0,1)$. <br /><br /><b>Proof</b>: <br /><br />Let $Z=F(T)$.<br /><br />$P(Z\leq z) = P(F(T)\leq z) = P(F^{-1} F(T) \leq F^{-1}(z) )<br />= P(T \leq F^{-1} (z) )<br />= F(F^{-1}(z))= z$. <br /><br />Since $P(Z\leq z)=z$, Z is uniformly distributed, that is, Uniform(0,1).<br /><br />A screengrab in case the above doesn't render: <br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-LvlznLzVFhk/VyImfsyyeOI/AAAAAAAAAdg/C3HUfgs9-IsaUh_-Xsedi2dJoQZ0A5dOgCLcB/s1600/pvals.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="128" src="https://3.bp.blogspot.com/-LvlznLzVFhk/VyImfsyyeOI/AAAAAAAAAdg/C3HUfgs9-IsaUh_-Xsedi2dJoQZ0A5dOgCLcB/s640/pvals.tiff" width="640" /></a></div><br /><br /><br /><div class="separator" style="clear: both; text-align: center;"></div>Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com2tag:blogger.com,1999:blog-21621108.post-25495428027780846232016-01-17T21:05:00.000+01:002016-01-19T15:27:49.750+01:00Automating R exercises and exams using the exams packageIt's a pain to design statistics exercises each semester, and because students from previous share old exercises with the new incoming students, it's hard to design simple exercises that students haven't already seen the answers to. On top of that, some students try to cheat during the exam by looking over the shoulder of their neighbors. Homework exercises almost always involve collaboration even if you prohibit it.<br /><br />It turns out that you can automate the generation of fixed-format exercises (with different numerical answers being required each time). You can also randomly select questions from a question bank you create yourself. And you can even create a unique question paper for each student in an exam, to make cheating between neighbors essentially impossible (even if they copy the correct answer to question 2 from a neighbor, they end up answering the wrong question on their own paper).<br /><br />All this magic is made possible by <a href="https://cran.r-project.org/web/packages/exams/index.html">the exams package in R</a>. The documentation is of course comprehensive and there is a journal article explaining everything:<br /><blockquote style="-webkit-text-stroke-width: 0px; color: black; font-family: Times; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">Achim Zeileis, Nikolaus Umlauf, Friedrich Leisch (2014). Flexible Generation of E-Learning Exams in R: Moodle Quizzes, OLAT Assessments, and Beyond. Journal of Statistical Software 58(1), 1-36. URL<span class="Apple-converted-space"> </span><a href="http://www.jstatsoft.org/v58/i01/" style="background: white; color: blue;">http://www.jstatsoft.org/v58/i01/</a>. </blockquote>I also use this package to deliver auto-graded exercises to students over datacamp.com. See <a href="http://www.ling.uni-potsdam.de/~vasishth/statistics/ESSLLI2015Vasishth.html" target="_blank">here</a> for the course I teach, and <a href="https://www.datacamp.com/courses/statistical-methods-for-linguistic-research-foundational-ideas" target="_blank">here</a> for the datacamp exercises.<br /><br />Here is a quick example to get people started on designing their own customized, automated exams. In my example below, there are several files you need. <br /><br />1. <b>The template files for your exam</b> (what your exam or homework sheet will look like), and the solutions file. I provide two example files: <a href="https://gist.github.com/vasishth/226e170ed856f51b9990" target="_blank">test.tex</a> and <a href="https://gist.github.com/vasishth/cba616388c4bc62263f0" target="_blank">solutiontest.tex</a><br /><br />2. <b>The exercises</b> <b>or exam questions themselves</b>: I provide two as examples. The first file is called <a href="https://gist.github.com/vasishth/ec4dd4afca10f6e8661f" target="_blank">pnorm1.Rnw</a>. It's an Sweave file, and it contains the code for generating a random problem and for generating its solution. The code should be self-explanatory. The second file is called <a href="https://gist.github.com/vasishth/8b082454bc079ed5b47b" target="_blank">sesamplesize1multiplechoice.Rnw</a> and has a multiple choice question.<br /><br />3. <b>The exam generating R code file</b>: The code is commented and self-explanatory. It will generate the exercises, randomize the order of presentation (if there are two or more exercises), and generate a solutions file. The output will be a single or multiple exam papers (depending on how many versions you wanted generated), and the solutions file(s). Notice the cool thing that even in my example, with only one question, the two versions of the exams have different numbers, so two people cannot collaborate and consult each other and just write down one answer. Each student could in principle be given a unique set of exercises, although it would be a lot of work to grade it if you do it manually.<br /><br /><a href="https://gist.github.com/vasishth/3b767e39dba9fc2df65b" target="_blank">Here</a> is the exam generating code:<br /><br />Save from the gists given above (a) the test.tex and solutiontest.tex files, (b) the Rnw files containing the exercise (pnorm1.Rnw, and sesamplesize1multiplechoice.Rnw), and (c) the exam generating code (ExampleExamCode.R). Put all of these into your working directory, say ExampleExam. Then run the R code, and be amazed.<br /><br />If something is broken in my example, please let me know.<br /> <br /><b>Shuffling questions</b>: If you want to reorder the questions in each run of the R code, just change myexamlist to sample(myexamlist) in the call below that appears in the file ExampleExamCode.R:<br /><br /> <pre><br />sol <- exams(sample(myexamlist), n = num.versions, <br /> dir = odir, template = c("test", "solutiontest"),<br /> nsamp=1,<br /> header = list(ID = getID, Date = Sys.Date()))<br /></pre><br /><br /><br /><br /></->Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com7tag:blogger.com,1999:blog-21621108.post-66464089159431883102016-01-06T18:27:00.002+01:002016-01-06T18:27:42.583+01:00My MSc thesis: A meta-analysis of relative clause processing in Mandarin Chinese using bias modellingHere is my MSc thesis, which was submitted to the University of Sheffield in September 2015. <br /><br />The pdf is <a href="http://www.ling.uni-potsdam.de/~vasishth/pdfs/VasishthMScStatistics.pdf">here</a>.<br /><br /><b>Title</b>: A Meta-analysis of Relative Clause Processing in Mandarin Chinese using Bias Modelling <br /><br /><b>Abstract</b><br />The reading difficulty associated with Chinese relative clauses presents an important empirical problem for psycholinguistic research on sentence comprehension processes. Some studies show that object relatives are easier to process than subject relatives, while others show the opposite pattern. If Chinese has an object relative advantage, this has important implications for theories of reading comprehension. In order to clarify the facts about Chinese, we carried out a Bayesian random-effects meta-analysis using 15 published studies; this analysis showed that the posterior probability of a subject relative advantage is approximately $0.77$ (mean $16$, 95% credible intervals $-29$ and $61$ ms). Because the studies had significant biases, it is possible that they may have confounded the results. Bias modelling is a potentially important tool in such situations because it uses expert opinion to incorporate the biases in the model. As a proof of concept, we first identified biases in five of the fifteen studies, and elicited priors on these using the SHELF framework. Then we fitted a random-effects meta-analysis, including priors on biases. This analysis showed a stronger posterior probability ($0.96$) of a subject relative advantage compared to the standard random-effects meta-analysis (mean $33$, credible intervals $-4$ and $71$). <br />Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com0tag:blogger.com,1999:blog-21621108.post-74423758467457880252015-12-19T02:13:00.000+01:002015-12-19T02:13:06.251+01:00Best statistics-related comment ever from a reviewerThis is the most interesting comment I have ever received from CUNY conference reviewing. It nicely illustrates the state of our understanding of statistical theory in psycholinguistics:<br /><br /><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #141823; display: inline !important; float: none; font-family: helvetica, arial, sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 19.32px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">"I had no idea how many subjects each study used. Were just one or two people</span><br style="-webkit-text-stroke-width: 0px; background-color: white; color: #141823; font-family: helvetica, arial, sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 19.32px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;" /><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #141823; display: inline !important; float: none; font-family: helvetica, arial, sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 19.32px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">used? ... Generally, I wasn't given enough data to determine</span><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #141823; display: inline !important; float: none; font-family: helvetica, arial, sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 19.32px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"> my confidence in the provided t-values (which depends on the degrees of freedom involved)."</span><br /><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #141823; display: inline !important; float: none; font-family: helvetica, arial, sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 19.32px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"> </span>Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com0tag:blogger.com,1999:blog-21621108.post-8253401558156726542015-08-27T23:46:00.002+02:002015-08-27T23:46:54.235+02:00Five thirty-eight provides a brand new definition of the p-value<span style="-webkit-text-stroke-width: 0px; background-color: #fefefe; color: #222222; display: inline !important; float: none; font-family: AtlasGroteskWeb, 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 20px; orphans: auto; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">The Five-Thirty Eight blog provides a brand new definition of the p-value: </span><br /><span style="-webkit-text-stroke-width: 0px; background-color: #fefefe; color: #222222; display: inline !important; float: none; font-family: AtlasGroteskWeb, 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 20px; orphans: auto; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">http://fivethirtyeight.com/datalab/psychology-is-starting-to-deal-with-its-replication-problem/?ex_cid=538twitter </span><br /><br /><span style="-webkit-text-stroke-width: 0px; background-color: #fefefe; color: #222222; display: inline !important; float: none; font-family: AtlasGroteskWeb, 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 20px; orphans: auto; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">"A p-value is simply the probability of getting a result at least as extreme as the one you saw if your hypothesis is false.<span class="Apple-converted-space">"</span></span><br /><span style="-webkit-text-stroke-width: 0px; background-color: #fefefe; color: #222222; display: inline !important; float: none; font-family: AtlasGroteskWeb, 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 20px; orphans: auto; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"><span class="Apple-converted-space"><br /></span></span><span style="-webkit-text-stroke-width: 0px; background-color: #fefefe; color: #222222; display: inline !important; float: none; font-family: AtlasGroteskWeb, 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 20px; orphans: auto; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"><span class="Apple-converted-space">I thought this blog was run by Nate Silver, a statistician?</span></span>Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com0tag:blogger.com,1999:blog-21621108.post-87301670778497881062015-08-27T15:40:00.000+02:002015-08-28T07:01:01.060+02:00Observed vs True Statistical Power, and the power inflation indexPeople (including me) routinely estimate statistical power for future studies using a pilot study's data or a previously published study's data (or perhaps using the predictions from a computational model, such as <a href="http://www.ling.uni-potsdam.de/~engelmann/publications/EngelmannEtAl_JML_subm_150825.doc.pdf">Engelmann et al 2015</a>).<br /><br />Indeed, the author of the <a href="https://replicationindex.wordpress.com/">Replicability Index</a> has been using observed power to determine the replicability of journal articles. His observed power estimates are HUGE (in the range of 0.75) and seem totally implausible to me, given the fact that I can hardly ever replicate my studies. <br /><br />This got me thinking: <a href="http://www.stat.columbia.edu/~gelman/research/published/PPS551642_REV2.pdf">Gelman and Carlin</a> have shown that when power is low, Type M error will be high. That is, the observed effects will tend to be highly exaggerated. The issue with Type M error is easy to visualize.<br /><br />Suppose that a particular study has standard error 46, and sample size 37; this implies that standard deviation is $46\times \sqrt{37}= 279$. These are representative numbers from psycholinguistic studies. Suppose also that we know that the true effect (the absolute value, say on the millisecond scale for a reading study---<a href="https://twitter.com/FredHasselman/status/636919443298385920">thanks to Fred Hasselman</a>) is D=15. Then, we can compute Type S and Type M errors for replications of this particular study by repeatedly sampling from the true distribution.<br /><br />We can visualize the exaggerated effects under low power as follows (see below): On the x-axis you see the effect magnitudes, and on the y-axis is power. The red line is the power line of 0.20, which based on my own attempts at replicating my own studies (and mostly failing), I estimate to be an upper bound of the power of experiments in psycholinguistics (this is an upper bound, I think a more common value will be closer to 0.05). All those dots below the red line are exaggerated estimates in a low power situation, and if you were to use any of those points to estimate observed power, you would get a wildly optimistic power estimate which has no bearing with reality.<br /><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-2xynHRLiIf4/Vd8LT9obrlI/AAAAAAAAAbg/MoaD_f_PTa0/s1600/funnelplot.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="189" src="http://4.bp.blogspot.com/-2xynHRLiIf4/Vd8LT9obrlI/AAAAAAAAAbg/MoaD_f_PTa0/s320/funnelplot.tiff" width="320" /></a></div><br />What does this fact about Type M error imply for Replicability Index's calculations? It implies that if power is in fact very low, and if journals are publishing larger-than-true effect sizes (and we know that they have an incentive to do so, because editors and reviewers mistakenly think that lower p-values, i.e., bigger absolute t-values, give stronger evidence for the specific alternative hypothesis of interest), then Replicability Index is possibly hugely overestimating power, and therefore hugely overestimating replicability of results. <br /><br />I came up with the idea of framing this overestimation in terms of Type M error by defining something called a <b>power inflation index</b>. Here is how it works. For different "true" power levels, we repeatedly sample data, and compute observed power each time. Then, for each "true" power level, we can compute the ratio of the observed power to the true power in each case. The mean of this ratio is the power inflation index, and the 95% confidence interval around it gives us an indication (sorry Richard Morey! I know I am abusing the meaning of CI here and treating it like a credible interval!) of how badly we could overestimate power from a small sample study.<br /><br />Here is the code for simulating and visualizing the power inflation index:<br /><br /><script src="https://gist.github.com/vasishth/69020cc596568e11169e.js"></script><br /><br />And here is the visualization:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-a2eQpPAA9a8/Vd8NajsNy-I/AAAAAAAAAbs/ClMeNj8z9h0/s1600/pii.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="241" src="http://3.bp.blogspot.com/-a2eQpPAA9a8/Vd8NajsNy-I/AAAAAAAAAbs/ClMeNj8z9h0/s320/pii.tiff" width="320" /></a></div>What we see here is that if true power is as low as 0.05 (and we can never know that it is not since we never know the true effect size!), then using observed power can lead to gross overestimates by a factor of approximately 10! So, if Replicability Index reports an observed power of 0.75, what he might actually be looking at is an inflated estimate where true power is 0.08.<br /><br />In summary, we can never know true power, and if we are estimating it using observed power conditional on true power being extremely low, we are likely to hugely overestimate power.<br /><br />One way to test my claim is to actually try to replicate the studies that Replicability Index predicts has high replicability. My prediction is that his estimates will be wild overestimates and most studies will not replicate. <br /><br /><b>Postscript</b><br /><br />A further thing that worries me about Replicability Index is his sloppy definitions of statistical terms. <a href="https://replicationindex.wordpress.com/tag/observed-power/">Here</a> is how he defines power:<br /><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #141412; display: inline !important; float: none; font-family: 'Source Sans Pro', Helvetica, sans-serif; font-size: 16px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 24px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"></span><br /><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #141412; display: inline !important; float: none; font-family: 'Source Sans Pro', Helvetica, sans-serif; font-size: 16px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 24px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">"Power is defined as the long-run probability of obtaining significant results in a series of exact replication studies. For example, 50% power means that a set of 100 studies is expected to produce 50 significant results and 50 non-significant results."</span><br /><br />[Thanks to <a href="https://www.msu.edu/~durvasul/Hello!.html">Karthik Durvasula</a> for correcting my statement below!]<br />By not defining power of a test of a null hypothesis $H_0: \mu=k$, as the probability of rejecting the null hypothesis (as a function of different alternative $\mu$ such that $\mu\neq k$) <i>when it is false</i>, what this definition literally means is that if I sample from any distribution, including one where $\mu=0$, the probability of obtaining a significant result under repeated sampling is the power. Which of course is completely false. <br /><br /><b>Post-Post Script</b><br /><br />Replicability Index points out <a href="https://twitter.com/R__INDEX/status/636907324154728448">in a tweet</a> that his post-hoc power estimation corrects for inflation. But post-hoc power corrected for inflation requires knowledge of the true power, which is what we are trying to get at in the first place. How do you deflate "observed" power when you don't know what the true power is? Maybe I am missing something. <b> </b><br /><br /><br /><br /><br /><br /><br /><br /><br />Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com7tag:blogger.com,1999:blog-21621108.post-46153279943611645952015-08-17T11:22:00.000+02:002016-08-23T12:24:45.770+02:00Some reflections on teaching frequentist statistics at ESSLLI 2015I spent the last two weeks teaching frequentist and Bayesian statistics at the <a href="http://parles.upf.edu/llocs/esslli/welcome-esslli-2015">European Summer School in Logic, Language, and Information (ESSLLI)</a> in Barcelona, at the beautiful and centrally located Pompeu Fabra University. The course web page for the first week is <a href="http://parles.upf.edu/llocs/esslli/content/statistical-methods-linguistic-research-foundational-ideas">here</a>, and the web page for the second course is <a href="http://parles.upf.edu/llocs/esslli/content/statistical-methods-linguistic-research-advanced-tools">here</a>.<br /><br />All materials for the first week are available on github, see <a href="https://github.com/vasishth/ESSLLI2015Vasishth_Week1">here</a>. <br /><br />The frequentist course went well, but the Bayesian course was a bit unsatisfactory; perhaps my greater experience in teaching the frequentist stuff played a role (I have only taught Bayes for three years). I've been writing and rewriting my slides and notes for frequentist methods since 2002, and it is only now that I can present the basic ideas in five 90 minute lectures; with Bayes, the presentation is more involved and I need to plan more carefully, interspersing on-the-spot exercises to solidify ideas. I will comment on the Bayesian Data Analysis course in a subsequent post.<br /><br />The first week (five 90 minute lectures) covered the basic concepts in frequentist methods. The audience was amazing; I wish I always had students like these in my classes. They were attentive, and anticipated each subsequent development. This was the typical ESSLLI crowd, and this is why teaching at ESSLLI is so satisfying. There were also several senior scientists in the class, so hopefully they will go back and correct the misunderstandings among their students about what all this Null Hypothesis Significance Testing stuff gives you (short answer: it answers *a* question very well, but it's the wrong question, nothing that is relevant to your research question).<br /><br />I won't try to summarize my course, because the web page is online and you can also do exercises on datacamp to check your understanding of statistics (see <a href="http://www.ling.uni-potsdam.de/~vasishth/statistics/ESSLLI2015Vasishth.html">here</a>). You get immediate feedback on your attempts. <br /><br />Stepping away from the technical details, I tried to make three broad points: <br /><br />First, I spent a lot of time trying to clarify what a p-value is and isn't, focusing particularly on the issue of Type S and Type M errors, which Gelman and Carlin have discussed in <a href="http://www.stat.columbia.edu/~gelman/research/published/retropower_final.pdf">their excellent paper</a>.<br /><br /> Here is the way that I visualized the problems of Type S and Type M errors:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-iHiZIMXegDQ/VdGYcEMgdUI/AAAAAAAAAa4/LSC6x9Hs8DA/s1600/smerrors.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="170" src="https://2.bp.blogspot.com/-iHiZIMXegDQ/VdGYcEMgdUI/AAAAAAAAAa4/LSC6x9Hs8DA/s320/smerrors.tiff" width="320" /></a></div>What we see here is repeated samples from a Normal distribution with true mean 15 and a typical standard deviation seen in psycholinguistic studies (see slide 42 of my slides for lecture2). The horizontal red line marks the 20% power line; most psycholinguistic studies fall below that line in terms of power. The dramatic consequence of this low power is the hugely exaggerated effects (which tend to get published in major journals because they also have low p-values) and the remarkable proportion of cases where the sample mean is on the wrong side of the true value 15. So, you are roughly equally likely to get a significant effect with a sample mean smaller to much smaller than the true mean, or larger and much larger than the true mean. With low power, regardless of whether you get a significant result or not, if power is low, and it is in most studies I see in journals, you are just farting in a puddle.<br /><br />It is worth repeating this: once one considers Type S and Type M errors, even statistically significant results become irrelevant, if power is low. It seems like these ideas are forever going to be beyond the comprehension of researchers in linguistics and psychology, who are trained to make binary decisions based on p-values, weirdly accepting the null if p is greater than 0.05 and, just as weirdly, accepting their <i>favored</i> alternative if p is less than 0.05. The p-value is a truly interesting animal; it seems tha<a href="http://www.psicothema.com/psicothema.asp?id=4266">t a recent survey</a> of some 400 Spanish psychologists found that, despite their being active in the field for quite a few years on average, they had close to zero understanding of what a p-value gives you. Editors of top journals in psychology routinely favor lower p-values, because they mistakenly think this makes "the result" more convincing; "the result" is the favored alternative. So even seasoned psychologists (and I won't even get started with linguists, because we are much, much worse), with decades of experience behind them, often have no idea what the p-value actually tells you.<br /><br />A remarkable misunderstanding regarding p-values is the claim that it tells you whether the effect was "by chance". Here is an example from <a href="https://replicationindex.wordpress.com/tag/retraction/">Replication Index's blog</a>:<br /><br /><i><span style="background-color: white; color: #141412; display: inline; float: none; font-family: "source sans pro" , "helvetica" , sans-serif; font-size: 16px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 24px; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">"The Test of Insufficient Variance (TIVA) shows that the variance in z-scores is less than 1, but the probability of this event to occur by chance is 10%, Var(z) = .63, Chi-square (df = 11) = 17.43, p = .096."</span></i><br /><br /><span style="background-color: white; color: #141412; display: inline; float: none; font-family: "source sans pro" , "helvetica" , sans-serif; font-size: 16px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 24px; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">Here is another example from a self-published manuscript by Daniel Ezra Johnson: </span><br /><span style="background-color: white; color: #141412; display: inline; float: none; font-family: "source sans pro" , "helvetica" , sans-serif; font-size: 16px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 24px; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><br /></span><i>"If we perform a likelihood-ratio test, comparing the model with gender to a null model with no predictors, we get a p-value of 0.0035. This implies that it is very unlikely that the observed gender difference is due to chance." </i><br /><br /><span style="background-color: white; color: #141412; display: inline; float: none; font-family: "source sans pro" , "helvetica" , sans-serif; font-size: 16px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 24px; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">One might think that the above examples are not peer-reviewed, and that peer review would catch such mistakes. </span>But even people explaining p-values in publications are unable to understand that this is completely false. An example is Keith Johnson's textbook, Quantitative Methods in Linguistics, which repeatedly talks about "reliable effects" and effects which are and are not due to chance. It is no wonder that the poor psychologist/linguist thinks, ok, if the p-value is telling me the probability that the effect is due to chance, and if the p-value is low, then the effect is not due to chance and the effect must be true. The mistake here is that the p-value is telling you the probability of the result being "due to chance" conditional on the null hypothesis being true. It's better to explain the p-value as the probability of getting the statistic (e.g., t-value) or something more extreme, <i>under the assumption that the null hypothesis is true</i>. People seem to drop the italicized part and this starts to propagate the misunderstanding for future generations. To repeat, the p-value is a conditional probability, but most people interpret it as an unconditional probability because they drop the phrase "under the null hypothesis" and truncate the statement to be about effects being due to chance.<br /><br />Another bizarre thing I have repeatedly seen is misinterpreting the p-value as Type I error. Type I error is fixed at a particular value (0.05) before you run the experiment, and is the probability of your incorrectly rejecting the null when it's true, under repeated sampling. The p-value is what you get from your single experiment and is the conditional probability of your getting the statistic you got or something more extreme, <i>assuming that the null is true</i>. Even this point is beyond comprehension for psychologists (and of course linguists). <a href="http://journal.frontiersin.org/article/10.3389/fpsyg.2015.01100/full">Here is a bunch of psychologists</a> explaining in an article why a p=0.000 should not be reported as an exact value:<br /><br /><div class="page" title="Page 6"><div class="layoutArea"><div class="column"><span style="font-family: "minionpro"; font-size: 10.000000pt; font-weight: 700;"> </span><i><span style="font-family: "minionpro"; font-size: 10pt; font-weight: 700;">"p </span><span style="font-family: "mtsyn"; font-size: 10.000000pt;">= </span><span style="font-family: "minionpro"; font-size: 10.000000pt; font-weight: 700;">0.000</span><span style="font-family: "minionpro"; font-size: 10.000000pt;">. Even though this statistical expression, used in over 97,000 manuscripts according to </span><span style="font-family: "minionpro"; font-size: 10pt;">Google Scholar</span><span style="font-family: "minionpro"; font-size: 10.000000pt;">, makes regular cameo appearances in our computer printouts, we should assiduously avoid inserting it in our </span><span style="font-family: "minionpro"; font-size: 10pt;">Results </span><span style="font-family: "minionpro"; font-size: 10.000000pt;">sections. <b>This expression implies erroneously that there is a </b></span><b><span style="font-family: "minionpro"; font-size: 10pt;">zero </span></b><span style="font-family: "minionpro"; font-size: 10.000000pt;"><b>probability that the investigators have committed a Type I error</b>, that is, a false rejection of a true null hypothesis (</span><span style="color: rgb(30.000000% , 30.000000% , 30.000000%); font-family: "minionpro"; font-size: 10.000000pt;">Streiner, 2007</span><span style="font-family: "minionpro"; font-size: 10.000000pt;">). That conclusion is logically absurd, because unless one has examined essentially the entire population, there is always some chance of a Type I error, no matter how meager. Needless to say, the expression “</span><span style="font-family: "minionpro"; font-size: 10pt;">p </span><span style="font-family: "rblmi"; font-size: 10.000000pt;">< </span><span style="font-family: "minionpro"; font-size: 10.000000pt;">0.000” is even worse, as the probability of committing a Type I error cannot be less than zero. Authors whose computer printouts yield significance levels of </span><span style="font-family: "minionpro"; font-size: 10pt;">p </span><span style="font-family: "mtsyn"; font-size: 10.000000pt;">= </span><span style="font-family: "minionpro"; font-size: 10.000000pt;">0.000 should instead express these levels out to a large number of decimal places, or at least indicate that the probability level is below a given value, such as </span><span style="font-family: "minionpro"; font-size: 10pt;">p </span><span style="font-family: "rblmi"; font-size: 10.000000pt;">< </span><span style="font-family: "minionpro"; font-size: 10.000000pt;">0.01 or </span><span style="font-family: "minionpro"; font-size: 10pt;">p </span><span style="font-family: "rblmi"; font-size: 10.000000pt;">< </span><span style="font-family: "minionpro"; font-size: 10.000000pt;">0.001."</span></i></div></div></div><br />The p-value is the probability of committing a Type I error, eh? It is truly embarrassing that people who are teaching this stuff have distorted the meaning of the p-value so drastically and just keep propagating the error. I should mention though that this paper I am citing appeared in Frontiers, which I am beginning to question as a worthwhile publication venue. Who did the peer review on this paper and why did they not catch this basic mistake? <br /><br />Even Fisher (<span style="color: #0647ab; display: inline; float: none; font-family: "arial" , "helvetica" , sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; text-align: -webkit-left; text-indent: 0px; text-transform: none; white-space: nowrap; word-spacing: 0px;">p. 16 of The Design of Experiments, Second Edition, 1937</span>) didn't buy the p-value; he is advocating for replicability as the real decisive tool:<br /><br /><i><span class="s1">"It </span>is usual <span class="s2">and </span><span class="s3">convenient </span>for <span class="s1">experimenters </span>to take-5 <span class="s2">per </span>cent. as a <span class="s3">standard </span>level <span class="s3">of </span>significance, in <span class="s2">the </span>sense <span class="s4">that </span><span class="s3">they </span><span class="s2">are prepared </span>to ignore all results which fail to <span class="s2">reach </span>this <span class="s2">standard, and, </span><span class="s4">by </span>this means, to eliminate from further discussion <span class="s2">the </span><span class="s3">greater </span><span class="s4">part of </span><span class="s2">the </span>fluctuations which <span class="s2">chance </span>causes <span class="s3">have </span><span class="s1">introduced into their experimental </span>results. <span class="s1">No </span>such selection <span class="s2">can </span>eliminate <span class="s2">the </span>whole <span class="s4">of </span><span class="s2">the </span>possible effects <span class="s3">of </span>chance. coincidence, <span class="s4">and if </span>we <span class="s2">accept </span>this convenient convention, <span class="s4">and </span><span class="s2">agree </span><span class="s5">that </span><span class="s6">an </span><span class="s2">event </span>which would occur <span class="s3">by </span><span class="s1">chance only </span>once in <span class="s7">70 </span>trials is <span class="s6">decidedly" </span>significant," in <span class="s2">the </span>statistical sense, <b>we <span class="s2">thereby admit </span><span class="s5">that </span>no isolated experiment, <span class="s1">however </span>significant in itself, <span class="s3">can </span>suffice for <span class="s3">the </span><span class="s2">experimental </span><span class="s1">demonstration </span><span class="s4">of </span><span class="s5">any </span><span class="s3">natural </span>phe<span class="s4">nomenon; </span>for <span class="s2">the </span><span class="s8">"one </span>chance in a <span class="s4">million" </span>will <span class="s3">undoubtedly </span>occur, with no less <span class="s5">and </span>no <span class="s1">more </span><span class="s2">than </span>its <span class="s3">appropriate </span>frequency, however <span class="s1">surprised </span>we <span class="s3">may be </span><span class="s5">that </span><span class="s2">it </span><span class="s1">should </span>occur to </b><b>us.</b><span class="Apple-tab-span"> </span><span class="s5">In </span><span class="s3">order </span>to <span class="s1">assert </span><span class="s4">that </span>a <span class="s3">natural </span><span class="s1">phenomenon </span>is <span class="s2">experimentally </span><span class="s1">demonstrable </span>we need, <span class="s2">not </span><span class="s4">an </span>isolated record, <span class="s5">but </span>a reliable <span class="s1">method </span><span class="s4">of </span>procedure.<span class="Apple-tab-span"> </span><span class="s9">In </span>relation to <span class="s2">the </span>test <span class="s2">of </span>significance, <b>we <span class="s5">may </span><span class="s2">say </span><span class="s4">that </span>a <span class="s2">phenomenon </span>is <span class="s1">experimentally demonstrable </span>when we <span class="s2">know </span><span class="s3">how </span>to <span class="s1">conduct </span><span class="s4">an </span><span class="s2">experiment </span>which will <span class="s2">rarely </span>fail <span class="s1">to give </span>us a statistically significant result</b>."</i><br /><br />Second, I tried to clarify what a 95% confidence interval is and isn't. At least a couple of students had a hard time accepting that the 95% CI refers to the procedure and not that the true $\mu$ lies within one specific interval with probability 0.95, until I pointed out that $\mu$ is just a point value and doesn't have a probability distribution associated with it. Morey and Wagenmakers and Rouder et al have been shouting themselves hoarse about confidence intervals, and <a href="http://link.springer.com/article/10.3758/s13423-013-0572-3#page-1">how many people don't understand them</a>, also see <a href="https://learnbayes.org/papers/confidenceIntervalsFallacy/">this paper</a>. Ironically, psychologists have responded to these complaints through various media, but even this response only showcases how psychologists have only a partial and misconstrued understanding of confidence intervals. I feel that part of the problem is that scientists hate to back off from a position they have taken, and so they tend to hunker down and defend defend defend their position. From the perspective of a statistician who understands both Bayes and frequentist positions, the conclusion would have to be that Morey et al are right, but for large sample sizes, the difference between a credible interval and a confidence interval (I mean the actual values that you get for the lower and upper bound) are negligible. You can see examples in our <a href="http://arxiv.org/abs/1506.04967">recently ArXiv'd paper</a>.<br /><br />Third, I tried to explain that there is a cultural difference between statisticians on the one hand and (most) psychologists and almost all psychologist, linguists, etc. on the other. For the latter group (with the obvious exception of people using Bayesian methods for data analysis), the whole point of fitting a statistical model is to do a hypothesis test, i.e., to get a p-value out of it. They simply do not care what the assumptions and internal moving parts of a t-test or a linear mixed model are. I know lots of users of lmer who are focused on one and only one thing: is my effect significant? I have repeatedly seen experienced experimenters in linguistics simply ignoring the independence assumption of data points when doing a paired t-test; people often do paired t-tests on unaggregated data, with multiple rows of data points for each subject (for example). This leads to spurious significance effects, which they happily and unquestioningly accept because that was the whole goal of the exercise. I show some examples in my lecture2 slides (slide 70).<br /><br />It's not just linguists, you can see the consequences of ignoring the independence assumption in <a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0132145">this reanalysis</a> of the infamous study on how future tense marking in language supposedly influences economic decisions. Once the dependencies between languages are taken into account, the conclusion that Chen originally drew doesn't really hold up much: <br /><br />" <i>When applying the strictest tests for relatedness, and when data is not aggregated across individuals, the correlation is not significant</i>." <br /><br />Similarly, <a href="https://www0.gsb.columbia.edu/mygsb/faculty/research/pubfiles/4679/power.poses_.PS_.2010.pdf">Amy Cuddy et al's study</a> on how power posing increases testosterone levels also got published only because the p value just scraped in below 0.05 at 0.045 or so. You can see in their figure 3 reporting the testosterone increase<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-yTXM6P4iDSs/VdGfqRvEgjI/AAAAAAAAAbI/L6EgVbmYOAI/s1600/powerposing.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="240" src="https://3.bp.blogspot.com/-yTXM6P4iDSs/VdGfqRvEgjI/AAAAAAAAAbI/L6EgVbmYOAI/s320/powerposing.tiff" width="320" /></a></div><br /><br />that their confidence intervals are huge (this is probably why they report standard errors, it wouldn't look so impressive if they had reported CIs). All they needed to show to make their point was to get the p-value below 0.05. The practical relevance of a 12 picogram/ml increase in testosterone is left unaddressed. Another recent example from Psychological Science, which seems to publish studies that might attract attention in the popular press, is this study on <a href="http://ubc-emotionlab.ca/wp-content/uploads/2013/01/replication-by-others-in-psych-science.pdf">how ovulating women wear red</a>. This study is a follow up on the<a href="http://www.slate.com/articles/health_and_science/science/2013/07/statistics_and_psychology_multiple_comparisons_give_spurious_results.html"> notorious</a> Psychological Science study by <a href="http://www.bryanburnham.net/wp-content/uploads/2014/01/Beall-Tracy-2013.pdf">Beall and Tracy</a>. In my opinion, the Beall and Tracy study reports a bogus result because they claim that women wear red or pink when ovulating, but when I reanalyzed their data I found that the effect was driven by pink alone. Here is my GLM fit for red or pink, red only and pink only. You can see that the "statistically significant" effect is driven entirely by pink, making the title of their paper (<i>Women Are More Likely to Wear Red or Pink at Peak Fertility</i>) true only if you allow the exclusive-or reading of the disjunction:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-sgWpOtfIy7A/VdGgdAA0GTI/AAAAAAAAAbQ/DkCStLDAuNs/s1600/redpink.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="116" src="https://2.bp.blogspot.com/-sgWpOtfIy7A/VdGgdAA0GTI/AAAAAAAAAbQ/DkCStLDAuNs/s320/redpink.tiff" width="320" /></a></div><br />The new study by <a href="http://ubc-emotionlab.ca/wp-content/uploads/2013/01/replication-by-others-in-psych-science.pdf">Eisenbruch et al</a> reports a statistically significant effect on this red-pink issue, but now it's only about red:<br /><br /><i>"A mixed regression model confirmed that, within subjects, the odds of wearing red were higher during the estimated fertile window than on other cycle days, b = 0.93, p = .040, odds ratio (OR) = 2.53, 95% confidence interval (CI) = [1.04, 6.14]. The 2.53 odds ratio indicates that the odds of wearing a red top were about 2.5 times higher inside the fertile window, but there was a wide confidence interval."</i><br /><br />To their credit, they note that their confidence interval is huge, and essentially includes 1. But since the p-value is below 0.05 this result is considered evidence for the "red hypothesis". It may well be that women who are ovulating wear red; I have no idea and have no stake in the issue. Certainly, I am not about to start looking at women wearing red as potential sexual partners (quite independent from the fact that my wife would probably kill me if I did). But it would be nice if people would try to do high powered studies, and report a replication in the same study they publish. Luckily nobody will die if these studies report mistaken results, but the same mistakes are happening in medicine, where people will die as a result of incorrect conclusions being drawn.<br /><br />All these examples show why the focus on p-values is so damaging for answering research questions.<br /><br />Not surprisingly, for the statistician, the main point of fitting a model (even in a confirmatory factorial analysis) is not to derive a p-value from it; in fact, for many statisticians the p-value may not even rise to consciousness. The main point of fitting a model is to define a process which describes, in the most economical way possible, how the data were generated. If the data don't allow you to estimate some of the parameters, then, for a statistician it is completely reasonable to back off to defining a simpler generative process. <br /><br />This is what Gelman and Hill also explain in their 2007 book (italics mine). Note that they are talking about fitting Bayesian linear mixed models (in which parameters like correlations can be backed off to 0 by using appropriate priors; see the Stan code using lkj priors <a href="http://arxiv.org/abs/1506.06201">here</a>), not frequentist models like lmer. Also, Gelman would never, ever compute a p-value. <br /><br />Gelman and Hill 2007, p. 549:<br /><br /><div class="page" title="Page 567"><div class="layoutArea"><div class="column"><span style="font-family: "cmr10"; font-size: 10.000000pt;">"Don’t get hung up on whether a coe</span><span style="font-family: "cmr10"; font-size: 10.000000pt;">ffi</span><span style="font-family: "cmr10"; font-size: 10.000000pt;">cient “should” vary by group. Just allow it to vary in the model, and then, if the estimated scale of variation is small (as with the varying slopes for the radon model in Section 13.1), <i>maybe you can ignore it if that would be more convenient</i>. </span><br /><span style="font-family: "cmr10"; font-size: 10.000000pt;">Practical concerns sometimes limit the feasible complexity of a model—for example, we might fit a varying-intercept model first, then allow slopes to vary, then add group-level predictors, and so forth. <i>Generally, however, it is only the di</i></span><i><span style="font-family: "cmr10"; font-size: 10.000000pt;">ffi</span><span style="font-family: "cmr10"; font-size: 10.000000pt;">culties of fitting and, especially, understanding the models that keeps us from adding even more complexity, more varying coe</span><span style="font-family: "cmr10"; font-size: 10.000000pt;">ffi</span></i><span style="font-family: "cmr10"; font-size: 10.000000pt;"><i>cients, and more interactions</i>." </span></div></div></div><br />For the statistician, simplicity of expression and understandability of the model (in the Gelman and Hill sense of being able to derive sensible posterior (predictive) distributions) are of paramount importance. For the psychologist and linguist (and other areas), what matters is whether the result is statistically significant. The more vigorously you can reject the null, the more excited you get, and the language provided for this ("highly significant") also gives the illusion that we have found out something important (=significant).<br /><br />This seems to be a fundamental disconnect between statisticians, and end-users who just want their p-value. A further source of the disconnect is that linguists and psychologists etc. look for cookbook methods, what a statistician I know once derisively called a "one and done" approach. This leads to blind data fitting: load data, run single line of code, publish result. No question ever arises about whether the model even makes sense. In a way this is understandable; it would be great if there was a one-shot solution to fitting, e.g., linear mixed models. It would simplify life so much, and one wouldn't need to spend years studying statistics before one can do science. However, the same scientists who balk at studying statistics will willingly spend time studying their field of expertise. No mainstream (by which I mean Chomskyan) syntactician is going to ever use commercial software to print out his syntactic derivation without knowing anything about the syntactic theory. Yet this is exactly what these same people expect from statistical software, to get an answer without having any understanding of the underlying statistical machinery.<br /><br /> The bottom line that I tried to convey in my course was: forget about the p-value (except to soothe the reviewer and editor and to build your career), focus on doing high powered studies, check model assumptions, fit appropriate models, replicate your findings, and publish against your own pet theories. Understanding what all these words means requires some study, and we should not shy away from making that effort.<br /><br />PS I am open to being corrected---like everyone else, I am prone to making mistakes. Please post corrections, but with evidence, in the comments section. I moderate the comments because some people post spam there, but I will allow all non-spam comments.<br /><br />PPS The teaching evaluation for this course just came in from ESSLLI; here it is. I believe 5.0 is a perfect score.<br /><br /><div style="-webkit-text-stroke-width: 0px; background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 19.2px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"><em>Statistical methods for linguistic research: Foundational Ideas (Vasishth)</em></div><table style="-webkit-text-stroke-width: 0px; background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 19.2px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"><tbody><tr><td style="font-family: arial, sans-serif; margin: 0px;">Lecturer1</td><td style="font-family: arial, sans-serif; margin: 0px;">4.9</td></tr><tr><td style="font-family: arial, sans-serif; margin: 0px;"><br /></td><td style="font-family: arial, sans-serif; margin: 0px;"><br /></td></tr><tr><td style="font-family: arial, sans-serif; margin: 0px;">Did the course content correspond to what was proposed?</td><td style="font-family: arial, sans-serif; margin: 0px;">4.9</td></tr><tr><td style="font-family: arial, sans-serif; margin: 0px;">Course notes</td><td style="font-family: arial, sans-serif; margin: 0px;">4.6</td></tr><tr><td style="font-family: arial, sans-serif; margin: 0px;">Session attendance</td><td style="font-family: arial, sans-serif; margin: 0px;">4.4</td></tr></tbody></table><div style="-webkit-text-stroke-width: 0px; background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 19.2px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"><small>(19 respondents)</small></div><ul style="-webkit-text-stroke-width: 0px; background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 19.2px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"><li style="margin-left: 15px;">Very good course.</li><li style="margin-left: 15px;">The lecturer was simply great. He has made hard concepts really easy to understand. He also has been able to keep the class interested. A real pity to miss the last lecture !!</li><li style="margin-left: 15px;">The only reason that this wasn't the best statistics course is that I had a great lecturer at my university on this. Very entertaining, informative, and correct lecture, I can't think of anything the lecturer could do better.</li><li style="margin-left: 15px;">Informative, deep and witty. Simply awesome.</li><li style="margin-left: 15px;">Professor Shravan Vasishth was hands down the best lecturer at ESSLLI 2015. I envy the people who actually get to learn from him for a whole semester instead of just a week or two. The course was challenging for someone with not much background in statistics, but Professor Vasishth provided a bunch of additional material. He's the best!</li><li style="margin-left: 15px;">Great course, very detailed explanations and many visual examples of some statistical phenomena. However, it would be better to include more information on regression models, especially with effects (model quality evaluation, etc) and more examples of researches from linguistic field.</li><li style="margin-left: 15px;">It was an extremely useful course presented by one of the best lecturers I've ever met. Thank you!</li><li style="margin-left: 15px;">Amazing course. Who would have thought that statistics could be so interesting and engaging? Kudos to lecturer Shravan Vasishth who managed to condense so much information into only 5 sessions, who managed to filter out only the most relevant things that will be applicable and indeed used by everyone who attendet the course and who managed to show the usefulness of the material. A great lecturer who never went on until everything was cleared up and made even the most daunting of statistical concepts seem surmountable. The only thing I'm sorry for is not having the opportunity to take his regular, semester-long statistics course so I can enjoy a more in depth look at the material and let everything settle properly. Five stars, would take again.</li><li style="margin-left: 15px;">Absolutely great!!</li></ul><br class="Apple-interchange-newline" /><br /><br /><!--0--><!--0--><br />Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com3tag:blogger.com,1999:blog-21621108.post-6082311848998408692015-02-14T08:55:00.001+01:002016-02-03T10:52:36.677+01:00Getting a statistics education: Review of the MSc in Statistics (Sheffield)<br /><a href="http://3.bp.blogspot.com/-bGXTfLwRuEM/VKLEshWKR7I/AAAAAAAAAaE/zipIfdVPECo/s1600/DSC_6056.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="400" src="http://3.bp.blogspot.com/-bGXTfLwRuEM/VKLEshWKR7I/AAAAAAAAAaE/zipIfdVPECo/s1600/DSC_6056.jpg" width="266" /></a>[This post was written between Sept 2012 and Feb 2015. I will post an update in Sept. 2015] <b></b><br /><b></b><br /><b></b><br /><b><br /></b><b>Last edit: June 27, 2015</b><br /><br /><b>Final edit: Nov 3, 2015 (added MSc thesis grade)</b><br /><br /><b>Some background</b>:<br /><br />I started using statistics for my research sometime in 1999 or 2000. I was a student at Ohio State, Linguistics, and I had just gotten interested in psycholinguistics. I knew almost nothing about statistics at that time. I did one Intro to Stats course in my department with Mike Broe (4 weeks), and that was it. In 1999 I developed repetitive strain injury, partly from using Excel and SPSS, and started googling for better statistical software. Someone pointed me to <a href="http://hcibib.org/perlman/stat/">|stat</a>, but eventually I found R. That was a transformative moment.<br /><br />The next stage in my education came in 2000, when I decided to go to the Statistical Consulting department at OSU and showed them my repeated measure ANOVA analyses. The response I got was: why are you fitting ANOVAs? You need linear mixed models. The statisticians showed me what I had to do code-wise, and I went ahead and finished my dissertation work using the nlme package. The <a href="http://www.amazon.de/Mixed-Effects-Models-S-PLUS-Statistics-Computing/dp/1441903178">Pinheiro and Bates </a>book had just come out then and I got myself a copy, understanding almost nothing in the book beyond the first few chapters.<br /><br />After that, I published a few more papers on sentence processing using nlme and then lmer, and in 2011 I co-wrote a book with Mike Broe (the basic template of the book was based on his lecture notes at OSU, he had used Mathematica or something like that, but I used R and expanded on his excellent simulation-based approach). This book revealed the incompleteness of my understanding, as spelled out in the <a href="http://onlinelibrary.wiley.com/doi/10.1111/j.1751-5823.2011.00159_7.x/abstract">scathing (and well-deserved) critique by Christian Robert</a>. Even before this review came out, I had already realized in early 2011 that I didn't really understand what I was doing. My sabbatical was coming up in winter 2011, and I enrolled for the graduate certificate in statistics at Sheffield to get a better understanding of statistical theory. Here is my review of the <a href="http://vasishth-statistics.blogspot.de/2011/12/part-1-of-2-review-of-graduate.html">distance-based graduate certificate in statistics taught at Sheffield</a>. <br /><br />At the end of that graduate certificate, I felt that I still didn't really understand much that was of practical relevance to my life as a researcher. That led me to do the MSc in Statistics at Sheffield, which I have been doing over three years (2012-15). This is a review of the MSc program. <strike>I haven't actually finished the program yet, but I think I know enough to write the review.</strike> My hope is that this overview will provide others a guide-map on one possible route one can take to achieving better understanding of data analysis, and what to expect if one takes this route. <br /><br /><b>Short version of this review</b>: The three year distance MSc program at Sheffield is outstanding. I highly recommend it to anyone wanting to acquire a good, basic understanding of statistical theory and inference. You can alternatively do the course over two years (probably impossible or very hard if you are also working full time, like me), or over one year full time (I don't know how people can do the degree in one year and still enjoy it). Be prepared to work hard and to find your own answers.<br /><br /><b>Long version</b>:<br /><br /><b>Cost</b>: For EU citizens, the three-year part time program costs about 2000 British pounds a year, not including the travel costs to get to Sheffield for the annual exams and presentations. For non-EU citizens, it's about 5000 pounds a year, still cheaper than most US programs.<br /><br /><b>Summary notes</b> <b>of the MSc program</b>: I made summary notes for the exams during the three years. These are still very much in progress and are available from:<br /><br /><a href="https://github.com/vasishth/MScStatisticsNotes">https://github.com/vasishth/MScStatisticsNotes</a><br /><br />The courses I found most interesting and practically useful for my own research were Linear Modelling, Inference (Bayesian Statistics and Computational Inference), Medical Statistics, and Dependent data (Multivariate Analysis).<br /><br /><b>Course structure</b>: Over three years, one does two courses each year, plus a dissertation. One has to commit about 15-20 hours a week in the 3-year program, although I think I did not do that much work, more like 12 hours a week on average (I had a lot of other work to do and just didn't have enough time to devote to statistics). There are four 3 hour sort-of open book exams that one has to go to Sheffield for, plus a group oral presentation, a simulated consultation, and project submissions. Every course has regular assignments/projects, all are graded but only a subset count for the final exam (15% of the final grade). The minimum you have to get to pass is 50%. <br /><br />The MSc program is taught to residential students and to distance students in parallel: the residentials are there in Sheffield, attending lectures etc. The distance students follow the course over a mailing list. So, someone like me, who's doing the course over three years, is going to overlap with three batches of the MSc residential students. This has the effect that one has no classmates one knows, except maybe others who are doing the same three-year sequence with you. <br /><br />The exams, which are the most stressful part of the program, are open book in that one can bring lecture notes and one's own but no textbooks. However, the exams are designed in such a way that if you don't already know the material inside out, there is almost no point in taking lecture notes in with you---there won't be enough time to look up the notes. I did take the official lecture notes with me for the first three exams, but I never once opened them. Instead, I only relied on my own summary sheets. Also, the exams are designed so that most people can't finish the required questions (any 5 out of 6) in the three hours. At least I never managed to finish all the questions to my satisfaction in any exam.<br /><br /><u><b>The first year (2012-13)</b></u><br /><br />The first year courses were <a href="https://www-online.shef.ac.uk/pls/live/web_cal.cal_unit_detail?unit_code=MAS6002&ctype=ACAD+YR&start_date=27-SEP-10&mand=Optional">6002 (Stats Lab)</a> and <a href="http://maths.dept.shef.ac.uk/maths/module_info_1139.html">6003 (Linear Modelling)</a>. There was a project-based assessment for the first, and a 3 hour exam for the second.<br /><br /><b>6002 (Stats Lab)</b>: most of the course was about learning R, which anyone who had done the grad certificate did not need. It was only in the last weeks that things got interesting, with optimization. I didn't like the notes on optimization and MLE much, though. There wasn't enough detail, and I had to go searching in books and on the internet to find comprehensive discussions. Here I would recommend <a href="http://ms.mcmaster.ca/~bolker/emdbook/">Ben Bolker's chapters 6-8</a>, which are on his web page, complete with .Rnw files. Also, I just found a neat looking book (not read yet) which I wish I had had in 2012: <a href="http://www.springer.com/mathematics/book/978-3-319-08262-2?otherVersion=978-3-319-08263-9">Modern Optimization with R</a>.<br /><br />Overall the Stats Lab course had the feel of an intro to R, which is what it should have been called. It should have been possible to test out of such a course---I did not need to read the first 12 of 13 chapters over 9 months, I could have done it in a week or less, I'm sure that's true for those of my classmates who did the graduate certificate. However, I do see the point of the course for non-R users. I guess this is the perennial problem of teaching; students come in with different levels, you have to cater to the lowest common denominator. Also, the introduction to R is pretty dated and needs a major overhaul. Much has happened since Hadley Wickham arrived on the scene, and it's a shame not to use his packages. Finally, the absence of literate programming tools was surprising to me. I expected it to be a standard operating procedure in statistics to use Sweave or the like.<br /><br /><b>6003 (Linear Modelling)</b>: this course was absolutely amazing. The lecture notes were very well-written and very detailed (with some exceptions, noted below). Linear mixed models didn't get a particularly detailed treatment; I would have preferred a matrix presentation of LMM theory, and would have liked to learn how to implement these models myself. <br /><br /><b>Some problems I faced</b> <b>in year 1</b>: <br />One issue in the course was the slow return of corrected assignments. By the time the assignment comes back graded (well, we just get general feedback and a grade), you've forgotten the details. Another strange aspect is that the grades for assignments were sometimes sent by regular air-mail. This was surprising in an online course.<br /><br />One frustrating aspect of the courses was that a number of statements were made without any justification, proof, or further explanation. Example: "In R the default choice is the corner-point constraints given above, but in SPlus the default is the Helmert form, which is more convenient computationally, though more difficult to interpret." Wow, I want to know more! But this point is never discussed again. One consequence is a feeling that one must simply take certain facts as given (or work it out yourself). I think it would have been helpful to point the interested student to a reference.<br /><br />The responses to questions on the mailing list are sometimes slow to come. Answers to questions asked online sometimes didn't really address the question, and one was left in the same state of uncertainty as earlier (a familiar feeling when you talk to a statistician!). <br /><br />Where the graduate certificate shone was in the excruciatingly detailed feedback; this was where I learnt the most in that course. By contrast, the feedback to some of the assignments was pretty sketchy. I never really knew what a perfect solution would have looked like.<br /><br />Of course, I can see why all this happens: professors are busy, and not always able to respond quickly to questions. I myself am sometimes just as slow to respond as a teacher; I guess I need to work on that aspect of my own teaching.<br /><br />My final marks in these first-year courses were 63 per cent in each course.<br /><br /><u><b>The second year (2013-14)</b></u><br /><br />The second year courses were <a href="http://maths.dept.shef.ac.uk/maths/module_info_1133.html">6001 (Data Analysis)</a> and <a href="http://maths.dept.shef.ac.uk/maths/module_info_1005.html">6004 (Inference: Bayesian Statistics and Computational Inference)</a>. There was a project-based assessment for the first, and a 3 hour exam for the second.<br /><br /><b>In Data Analysis</b> we did several projects which simulated real-life consulting, or involved doing actual experiments (e.g., building aeroplanes). There was one project where one had to choose a news media article about a piece of scientific work, and then compare it with the actual scientific work. The consulting project didn't work so well for me, because we were teamed up in fives and we didn't know each other. It was very hard to coordinate a project when all your colleagues are unknown to you, and email is the only way to communicate.<br /><br />For the news media article, I chose the article<a href="http://andrewgelman.com/2013/07/24/too-good-to-be-true-the-scientific-mass-production-of-spurious-statistical-significance/"> Gelman attacked on his blog</a>, about women wearing red to signal sexual availability. It was interesting because the claims in the Psych Science didn't really pan out. I reanalyzed the original data, and found that the effect was driven by pink, not red; the authors had recoded red and pink as red or pink, presumably in order to make the claim that women wear reddish hues. It's hard to believe that this was not a post-hoc step after seeing the data (although I think the authors claim it was not---I suppose it's possible that it wasn't); after all, if they had originally intended to treat red and pink as one unit color type, then why did they have two columns, one for red and one for pink? <br /><br />The Data Analysis course was definitely not challenging; it was rather below the level of data analysis I have to do in my own research. However, I was thankful not to be overloaded in this course because the Bayesian analysis course took up all my energy in my second year.<br /><br /><b>The course on Bayesian statistics</b> was a whole other animal. I read a lot of books that were not assigned as required readings (mostly, <a href="http://andrewgelman.com/2013/08/21/bda3-table-of-contents-also-a-new-paper-on-visualization/">Gelman et al's BDA3</a>, and <a href="http://www.crcpress.com/product/isbn/9781584888499">Lunn et al,</a> but also <a href="http://www.amazon.com/Introduction-Statistics-Estimation-Scientists-Behavioral/dp/1441924345">Lynch's excellent textbook</a>). I did all the three exercises that were assigned (these are graded but do not count for the final grade). My scores were 20/20, 22/30, 23/30. I never really understood what exactly led to those points being lost; not much detailed explanation was provided. One doesn't know how many marks one loses for making a figure too small, for example (I was following Gelman's example of showing lots of figures, which requires making them smaller, but evidently this was frowned upon). As is typical for this degree program, the grading is pretty harsh and tight-lipped (the harsh grading is not a bad thing; but the lack of information on what to improve in the answer was frustrating). <br /><br />The Bayesian lecture notes could be improved. They seem to have a disjointed feel; perhaps they were written by different people. The Bayesian lecture notes were very different than, say, the linear modeling notes, which really drilled the student on practical details of model fitting. In the Bayesian course, there were sudden transitions to topics that fizzled out quickly and were never resurrected. An example is decision theory; one section starts out defining some basic concepts, and then quickly ends. Inference and decision theory was never discussed. There were sections that were in the notes but not needed for the exams; for an MSc level program I would have wanted to read that material (and did). I had some questions on these non-examinable sections, but never could get an answer, which was pretty frustrating.<br /><br />The biggest thing that could be improved in these lecture notes is to provide more contact with code. Unfortunately, WinBUGS was introduced, and very late in the course, and then a fairly major project (which counts for the final grade) was assigned that was based entirely on modeling in WinBUGS. Apart from the fact that WinBUGS is just not a well-designed software (JAGS or Stan is much better), not much practice was given in fitting models, certainly not as much as was given for linear modelling. Model fitting should be an integral part of the course from the outset, and WinBUGS should be abandoned in favor of JAGS.<br /><br />If I had not done a lot of reading on my own, and not learnt JAGS and Stan, I would have really suffered in this course. Maybe that's what the lecture notes are intending to do: it's a graduate-level course, and maybe the expectation is that one looks up the details on one's own.<br /><br />As it was, I enjoyed doing the Bayesian exercises, which were very neat problems---just hard enough to make you think, but not so hard that you can't solve them if you think hard and do your own research.<br /><br />One thing that was never discussed in the Bayesian data analysis course was how to do statistical inference, for example in factorial $2\times 2$ repeated measures designs. Textbooks on Bayesian methods don't discuss this either; perhaps they consider it enough that you get the posterior; you can draw your own conclusions from that.<br /><br />I got scores in the mid 60s for each course. I think I had 63 in Data Analysis and 67 in Inference.<br /><br /><b>The third year </b><br /><br />The third year courses were <a href="http://maths.dept.shef.ac.uk/maths/module_info_1009.html">MAS6011 (Dependent data)</a> and <a href="http://maths.dept.shef.ac.uk/maths/module_info_1150.html">MAS6012 (Sampling, Design, Medical Statistics)</a>. There is a 3 hour exam for each course.<br /><br /><b>The dependent data course</b> was truly amazing. In the first semester, I got to grips with multivariate analysis, and with some interesting data mining type of tools such as PCA and linear discriminant analysis. The lecture notes could have been a lot more detailed for a graduate program; the lack of detail was probably due to the fact that undergrads and grad students were mixed in in the same class. The second semester was about time-series analysis, and was the best taught course and the most exciting I took in this MSc. For the first time, video lectures are being provided every week, and these are proving to be extremely helpful. <br /><br />What really resonated with me in this course was state space modeling. I wish the whole course had been about that topic; the ARIMA modeling framework of Box and Jenkins is really amazing but pales in significance when you see what SSMs can do. Maybe it would have been better to teach a two-semester sequence instead of compression a Data Mining type of course into the first semester, and TS into the second. I would happily have done another course instead of doing those Stats Labs and such like "soft" courses, as I mention elsewhere.<br /><br /><b>The Medical Statistics course</b> was fascinating because it was here that one finally saw issues being dealt with where people's lives would be at stake depending on the answer we obtain. One amazing fact I discovered is that Pocock 1983 considers power below 70% in an experiment to be <i>unethical</i>. Psycholinguists and psychologists routinely run low power studies and publish their null results in prestigious journals. Luckily nobody will die as a result of these studies! Another amazing fact is that frequentist statistics is standard practice in medicine. I would have expected that Bayesian stats would dominate in such a vitally important application of statistics. I am willing to use p-values to make a binary decision to help a journal editor feel good about the paper, but not if I am deciding whether drug X will help stave off death for a patient. I am really glad that I do not need to enter the job market as a statistician. If I were starting out my career after finishing this degree, I would probably have done into a pharma company, and it is horrifying to think that I would be forced to deliver p-values as decision-making tool.<br /><br />For the first semester, the medstats lecture notes were not that well written, with not much detail, full of typos and bullet point type presentations. The slides had no page numbers. These lecture notes and slides need a major overhaul in my opinion. I didn't get any detailed feedback on the first two exercises I submitted, and the feedback I did get I could not read as it was handwritten with one of those ball-point pens that don't steadily deliver ink. The feedback, such as it was, came in unusually late as well. By contrast, the survival analysis lecture notes were much better, and I learnt a lot. <br /><br />The second semester lecture notes and slides were on the design of experiments and sampling theory (stratified sampling, cluster sampling, capture-recapture sampling, etc.). The DoE part was outstanding; for the first time, I learnt how optimal experimental design is set out, and learnt to determine optimality of design using the General Equivalence Theorem. I think I would have liked to have this course right after Linear Modelling (in the three year distance program, this course and LM are separate by a year of coursework on computational statistics and Bayesian Data Analysis), although the gap did have one advantage that linear modeling theory had some time to sink in before I studied experiment design. I was less excited by the sampling part, but I think that this is because I am probably never going to be doing sample surveys. I just couldn't whip up enough enthusiasm for that topic, but I did hunker down and learn everything anyway. The second semester also came with weekly video recordings, so for the first time I was able to watch the same lecture that the residential students were attending.<br /><br /><b>Update</b>:<br /><br />I got 67% in Medical Statistics and 70% in Dependent Data (a distinction, my first in this MSc program!).<br /><br />This was much better than I expected; I write slowly (I enjoy writing with my high quality gold plated, lacquered Namiki Pilot fountain pen, and the sheer pleasure of having the pen glide over paper, leaving mesmerizing, exquisite strokes of black ink, slows me down a lot), and so I knew I would not be able to finish the papers, and I didn't. But I guess the questions that I answered I must have done reasonably OK. For these two exams, I practiced a lot more with the hand calculator too, and I noticed that practice makes me... well, not perfect, but better. I did stop making stupid mistakes like forgetting that log is by default to the base 10 and I have explicitly ask for a log_e, and mistyping multiplication when I meant division. (In the BDA exam I actually managed to get a probability greater than 1 in one answer due to this kind of idiocy. Since I didn't have time to go back to fix my mistake, I just wrote "doesn't make sense, there must be a calculation error somewhere", hoping that the grader will realize that I understand the method but can't type on a hand-held). <br /><br /><b>Final Update on the MSc Dissertation</b>:<br /><br />There's also a thesis to be written as part of the MSc; that counts for 60 credits in the 180 credit MSc program. I would have preferred to do more coursework than do the thesis, but I can see why a thesis is required (all our programs in Potsdam require them too). More on that in September or October 2015.<br /><br />The thesis work went quite well overall. Initially I ran into a very difficult situation that turned out to be due to my having coded up my model incorrectly. My advisor (Jeremy Oakley) was extremely responsive and deftly asked me the right questions that led me to find the bugs in my code; after that it was smooth sailing. That experience really showed (if that's not obvious) that it helps to work with an expert in the field. My final grade on the thesis was 73%.<br /><br /><b>General comments/suggestions for improvement</b>:<br /><br />1. The MSc currently has three specializations: Statistics, Medical Statistics, and Financial Statistics. Each has slightly different requirements (e.g., for Financial, you need to demonstrate specific math ability). I would add a fourth specialization, to reflect the needs of statisticians today. This could be called Computational Statistics or something like that.<br /><br />In this specialization, one could require a background in R programming, just as Financial Stats requires advanced math. One could replace Stats lab and Data Analysis with a course on Statistical Computing (following some subset of the contents of textbooks like <a href="http://www.amazon.com/Statistical-Computing-Chapman-Hall-Series/dp/1420066501">Eubank et al</a>, <a href="http://www.amazon.com/Seamless-Integration-Rcpp-Dirk-Eddelbuettel/dp/1461468671/ref=pd_bxgy_b_text_y">Eddenbeutel</a>, <a href="http://www.amazon.com/Modern-Optimization-R-Use/dp/3319082620/ref=pd_sim_b_6?ie=UTF8&refRID=0EKXRXS1KSAXGSTT1TSY">Cortez</a>, <a href="http://www.amazon.com/Advanced-Chapman-Hall-CRC-Series/dp/1466586966/ref=pd_bxgy_b_text_z">Hadley</a>), and Statistical Learning (aka Data Mining), following a textbook like <a href="http://www.amazon.com/Introduction-Statistical-Learning-Applications-Statistics/dp/1461471370/ref=pd_sim_b_4?ie=UTF8&refRID=1J5WFEXX26ZJ6C6H8ZD2">James et al</a>. I am sure that such a specialization is badly needed; see, for example, the puzzled question asked by a statistician not so long ago in AMSTAT news: <a href="http://magazine.amstat.org/blog/2013/07/01/datascience/">Aren't we data science?</a> One can't prepare statisticians as "data scientists" if they don't have serious computing ability.<br /><br />Some of the data mining related materials turns up in Dependent Data in year 3, and that's fine; there is much more that one needs exposure to today. For me, the Stats Lab and Data Analysis courses did not have enough bang for the buck. I can see that such courses could be useful to newcomers to R and data analysis (but at the grad level, I find it hard to believe that the student would have never seen R; I guess it's possible).<br /><br />But these courses didn't really challenge me to deal with real-life problems one might be likely to encounter as a future statistician (writing one's own packages, solving large-scale data mining problems). If there had been a more computationally oriented stream which assumed R, I would have taken that route. <br /><br />Some MS(c) programs with the kind of focus I am suggesting:<br />a. St Andrews: http://www.creem.st-and.ac.uk/datamining/structure.html <br />b. Another one in Sweden: http://www.liu.se/utbildning/pabyggnad/F7MSM/courses?l=en<br />c. Stanford: https://statistics.stanford.edu/academics/ms-statistics-data-science<br /><br />2. The lectures could have easily been recorded, this would have greatly enhanced the quality of the MSc. All you need is slides and a screen capture software with audio recording capability.<br /><br />[<b>Update: SOMAS now records the lectures in real time, and posts them on youtube. This has significantly improved course quality in my opinion, because it allows you watch an expert do the derivations on the board, and learn by copying/modeling that expert's approach to problem solving</b>.] <br /><br />3. The real value added in the MSc is the exercises, and the feedback after the exercises have been submitted. This is the only way that one learns new things in this course (apart from reading the lecture notes). The written exams are of course a crucial part of the program, but the solutions and one's own attempt are never released so one has only a limited opportunity to learn from one's mistakes in the exam. For about 2000 pounds a year, this is quite a bargain. Basically this is equivalent to hiring a statistician for 33 hours at 60 pounds an hour each year, with the big difference that you leave the table knowing much more than when you arrived. <br /><br />4. Some ideas that were difficult for me:<br />- Expectation of a function of random variables was taught in the grad cert in 2011, but I needed it for the first time in 2014, when studying the EM algorithm. It would have been helpful to see a practical application early.<br />- The exponential distribution is a key distribution and needs much more study, esp. in connection with modeling survival. Perhaps more time should be spent studying distributions and their interrelationships.<br />- The derivation of full conditional distributions could have been tightly linked to DAGs, as is done in the Lunn et al book. It was only after I read the Lunn et al book that I really understood how to work out the full conditional distribution in any (within reason) given Bayesian model.<br />- I learnt how to compute eigenvalues and eigenvectors in the graduate certificate, but didn't use this knowledge until 2014, when I did Multivariate Analysis. I didn't even understand the relevance of eigenvalues etc. until I saw the discussion on Principal Components Analysis. A tighter linkage between mathematical concepts and their application in statistics would be useful.<br />- Similarly, Lagrangian multipliers became extremely useful when we started looking at PCA and Linear Discriminant Analysis; I saw them in 2011 and forgot all about them. There must be some way to show the applications of mathematical ideas in statistics. After much searching, I found <a href="http://www.amazon.com/Advanced-Calculus-Applications-Statistics-Khuri/dp/0471391042">this useful book</a> that does part of the job. <br /><br />5. The entire MSc program basically provides the technical background needed to understand major topics in statistics; there is not enough time to go into much detail. Each chapter in each course could have been a full course (e.g., the EM algorithm). I think that the real learning will not begin until I start to apply these ideas to new problems (as opposed to, say, using already known routines like linear mixed models). So, what I can say is that after four years of hard work, I know enough to actually <i>start</i> learning statistics. I don't feel like I really know anything; I just know the lay of the land.<br /><br />6. The MSc is heavily dependent on R. Not having a python component to the course limits the student greatly, especially if they are going to go out there into the world as a ''data scientist''. The Enthought on-demand courses are a fantastic supplement to the MSc coursework. It would be a good idea to have a python course of that type in the MSc coursework as well.<br /><br />7. One mistake I made from the perspective of exam-taking was not to spend enough time during the year using the hand-calculator (actually, I spent no time on this). In the exam, the difference between a distinction and an upper second can be the speed with which you can compute (correctly!) on a calculator. I am terrible at this, rarely even able to do simple calculations correctly on a hand-held (I'm talking about really basic operations), simply because I don't use calculators in real life; who does? I would have much preferred exams that test analytical ability rather than ability to do calculations quickly on a calculator. In the real world one uses computers to do calculations anyway. I was also hindered by the fact that I am half-blind (a side effect of kidney failure when I wa 20) and can't even see the hand-calculator's screen properly. <br /><br />8. One peculiar aspect, and this permeated the MSc program, was the fairly antiquated instructions to students for using LaTeX etc. I think that statisticians should lead the way and use tools like Sweave and Knitr.<br /><br />9. The textbook recommendations are out of date should be regularly revised. The best textbooks I found for each course that had exams associated with it:<br /><br /><b>Linear modelling</b>: <a href="http://www.amazon.com/Introduction-Generalized-Chapman-Statistical-Science/dp/1584889500/ref=sr_1_1?ie=UTF8&qid=1420028531&sr=8-1&keywords=Introduction+to+generalized+linear+models">An Introduction to Generalized Linear Models, Dobson et al </a><br /><br />Dobson et al is the best textbook I have ever read on generalized linear models, bar maybe <a href="http://www.amazon.com/Generalized-Chapman-Monographs-Statistics-Probability/dp/0412317605/ref=sr_1_1?ie=UTF8&qid=1420028563&sr=8-1&keywords=McCullagh+Nelder">McCullagh and Nelder</a>. Dobson et al was a recommended book in the linear modeling course, a very good choice.<br /><br /><b>Bayesian Statistics</b>: Lynch, Lunn et al, BDA3, <a href="http://www.amazon.com/Bayesian-Inference-Statistical-Analysis-Paperback/dp/B0050X7WFS/ref=sr_1_1?ie=UTF8&qid=1420028588&sr=8-1&keywords=Box+Tiao">Box and Tiao</a><br /><br />Lynch is the best first book to read for Bayes (if you know calculus), and Lunn et al is very useful indeed, and beautifully written. It prepares you well for doing practical data analysis. Unfortunately, it's oriented towards WinBUGS, but one can translate the code easily to JAGS. In my opinion, WinBUGS was a great first attempt, but it should be retired now, because it is just so painful to use. People should go straight to JAGS (thanks to<a href="https://martynplummer.wordpress.com/"> Martyn Plummer</a> for doing just a fantastic job with JAGS) and then (or alternatively) Stan (thanks to <a href="http://www.cs.princeton.edu/~mdhoffma/">Matt Hoffman</a>, <a href="http://lingpipe-blog.com/">Bob Carpenter</a>, <a href="http://andrewgelman.com/">Andrew Gelman</a> and the <a href="http://mc-stan.org/">Stan</a> <a href="http://mc-stan.org/team.html">team</a> for making it possible to use Bayes for really complex problems). You really need both JAGS and Stan in order to read and understand books, especially if you are just starting out.<br /><br />I recommend reading Box and Tiao at the very end, to get a taste of (a) outstanding writing quality, and (b) what it was like to do Bayes in the pre-historic era (i.e., the 1970s).<br /><br /><b>Computational Inference</b>: <a href="http://www.amazon.com/Statistical-Computing-Chapman-Hall-CRC/dp/1584885459/ref=sr_1_1?ie=UTF8&qid=1420028644&sr=8-1&keywords=Statistical+Computing+with+R%2C+Rizzo">Statistical Computing with R, Rizzo</a><br />This book covers pretty much all of computational inference in a very user-friendly way, <br /><br /><b>Multivariate Analysis</b>: <a href="http://www.amazon.com/Mathematical-Tools-Applied-Multivariate-Analysis/dp/0121609553/ref=sr_1_1?ie=UTF8&qid=1420028676&sr=8-1&keywords=Mathematical+Tools+for+Applied+Multivariate+Analysis">Mathematical Tools for Applied Multivariate Analysis, By Carroll et al.</a><br /><br />This book is very heavy going and not an after-five kind of book, it needs serious and slow study. I used it mostly as a reference book. <br /><br /><b>Medical Statistics (Survival Analysis)</b>: <a href="http://www.amazon.com/Regression-Modeling-Strategies-Applications-Statistics/dp/1441929185/ref=sr_1_1?ie=UTF8&qid=1420028704&sr=8-1&keywords=Regression+Modeling+Strategies">Regression Modeling Strategies by Harrell</a>, and Dobson et al. I found the presentation of Survival Analysis in Harrell's book particularly helpful.<br /><br /><b>Concluding remarks</b><br /><br />This MSc program is very valuable for someone willing to work hard on their own, with rather variable amounts of guidance from the instructors. It provides a lot of good-quality structure, and it allows you to check your understanding objectively by way of exams.<br /><br /><b> Summary of grades</b>: You can see that I was starting to improve with experience!<br /><br />YEAR 1<br />63% in Stats Lab<br />64% in Linear Modelling<br />YEAR 2<br />63% in Data Analysis<br />67% in Inference<br />YEAR 3<br />67% in Medical Statistics<br />70% in Dependent Data (Distinction)<br /><br />73% in MSc dissertation (Distinction)<br /><br /><b>Overall grade: Pass with Merit</b>.<br /><br />Doing this MSc changed a lot of things for me professionally: <br /><br /><b>Teaching</b>:<br /><br />- I rewrote my <a href="http://www.ling.uni-potsdam.de/~vasishth/statistics/lecturenotes.html">lecture notes</a>, abandoning the statistics textbook I had written in 2011. The Sheffield coursework played a huge role in helping me clean up my notes. I think these notes still need a lot of work, and I plan to work on them during my coming sabbatical.<br /><br />-I started teaching undergrad Math as a prerequisite to my more technically oriented stats courses. <br /><br />- I started teaching Bayesian statistics as a standard part of the graduate linguistics coursework. There doesn't seem to be much interest among most linguistics students in this stuff, but I do attract a very special type of student in these classes and that makes teaching more fun.<br /><br />- I started teaching linear (mixed) modeling in a way aligns much more with standard presentations in the Sheffield MSc program.<br /><br />- At least <a href="http://www.ling.uni-potsdam.de/~nicenboim/">one of my students</a> has taken advantage of Bayesian methods in their research, so it's starting to have an impact. <br /><b><br /></b><b>Research</b>:<br /><br />- One thing that became clear (if it wasn't obvious already) is that becoming a professional statistician or at least acquiring professional training in statistics is a necessary condition to doing analyses correctly, but it isn't a sufficient condition. Statisticians usually are unable to address concerns from people in specific areas of research because they have no domain knowledge. It seems that without domain knowledge, statistical knowledge is basically useless. One should not go to statisticians seeking "recommendations" on what to do in particular situations. Depending on which statistician you talk to, you can get a very variable answer. Coupled with knowledge of your research area and knowledge of statistical theory (which of course you have to acquire, just as you acquired your domain knowledge), you have to work out the answer to your particular problem.<br /><br />- I have essentially abandoned null hypothesis significance testing and just use Bayesian methods. The linear modeling and Bayesian statistics plus computational inference courses were instrumental in making this transition possible. I still report p-values, but only because reviewers and editors of journals insist on them.<br /><br />- I run high-powered studies whenever possible (e.g., it's not possible to run high power studies with aphasic populations, at least not at Potsdam). Everything else is a waste of time and money.<br /><br />- I started posting all data and code online as soon as the associated paper is published. <br /><br />-I spend a lot of time visualizing the data and checking model assumptions before settling on a model.<br /><br />- I use bootstrapping a lot more to check whether my results hold up compared to more conventional methods. <br /><br />- I try to replicate my results, and try to publish replications both of my own work and of others (much more difficult than I anticipated---people think replication is irrelevant and uninformative once someone has published a result with p less than 0.05.<br /><br />- I can understand books like <a href="http://www.amazon.com/gp/product/1439840954/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=1439840954&linkCode=as2&tag=chrprobboo-20">BDA3</a>. This was not true in 2011. That was the biggest gain of putting myself through this thing; it made me literate enough to read technical introductions.<br /><br />- I have started working on statistical problems and trying to publish methods papers. Two recent examples:<br /><br />http://arxiv.org/abs/1506.06201<br />http://arxiv.org/abs/1506.04967<br /><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-Xwh9Apb8KKk/VrHN2fHQFqI/AAAAAAAAAcY/5ku2yhiW_OQ/s1600/MScDegreeVasishth.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="http://3.bp.blogspot.com/-Xwh9Apb8KKk/VrHN2fHQFqI/AAAAAAAAAcY/5ku2yhiW_OQ/s320/MScDegreeVasishth.tiff" width="224" /></a></div><br /><!--0--><!--0-->Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com3tag:blogger.com,1999:blog-21621108.post-77216211034305925402015-02-09T10:01:00.001+01:002015-02-09T10:01:07.775+01:00Another comment on Hornstein's comments on HagoortOn his <a href="http://facultyoflanguage.blogspot.de/2015/02/the-future-of-linguistics-two-views.html?showComment=1423449175368">blog</a>, Norbert Hornstein had the following exchange. The original Hagoort post is <a href="http://www.mpi.nl/departments/neurobiology-of-language/news/linguistics-quo-vadis-an-outsider-perspective">here</a>.<br /><br />############## <br /><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13.5px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 12.5999994277954px; orphans: auto; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">NH: " If Sprouse and Alemeida are right (which I assure you they are; read the papers) then there is nothing wrong with the data that GGers use."</span><br style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13.5px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 12.5999994277954px; orphans: auto; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;" /><br style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13.5px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 12.5999994277954px; orphans: auto; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;" /><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13.5px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 12.5999994277954px; orphans: auto; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">SV: One should never be 100% sure of anything. There is always uncertainty and we should openly discuss the range of possibilities whenever we present a conclusion, not just argue for one position. That has been a problem in psychology, with overly strong conclusions, and that is a problem in linguistics, experimentally driven or not. But this is specially relevant for statistical inference. We can never be sure of anything.</span><br /><br /><i><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13.5px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 12.5999994277954px; orphans: auto; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">NH: But I think that I disagree with your second point about being sure. One way of taking your point is that one should always be ready to admit that one is wrong. As a theoretical option, this is correct. BUT, I doubt very much anyone actually works in this way. Do you really leave open the option that, for example, thinking takes place in the kidneys and not the brain? Is it a live option for you that you see through the ears and see through the eyes? Is if a live option for you that gravitational attraction is stronger than electromagnetic forces over distances of 2 inches? We may be wrong about everything we have learned, but we this is a theoretical, not what in the 17th century was called a moral possibility. Moreover, there is a real down side to keeping too open a mind, which is what genuflecting to this theoretical option can engender. I find refuting flat earthers and climate science denialists a waste of intellectual time and effort. Is it logically possible that they are right? Sure. Is it morally possible? No. Need we open our minds to their possibilities? No. Should we? No. Same IMO with that GGers have found out about language. There are many details I am willing to discuss, but I believe that it is time to stop acting as if the last 60 years of results might one day go up in smoke. That's not being open minded, or it this is what being open minded requires, then so much the worse for being open minded.</span><br style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13.5px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 12.5999994277954px; orphans: auto; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;" /><br style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13.5px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 12.5999994277954px; orphans: auto; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;" /><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13.5px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 12.5999994277954px; orphans: auto; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">Let me say this another way: there are lots of things I expect to change over the course of the next 25 years of work in linguistics. However, there are many findings that I believe are settled effects. We will not wake up tomorrow and discover that reflexives resist binding or that all unbounded dependencies are created equal. These are not established facts, though there may be some discussion of the limits of their relevance. But they won't all go away. But this is precisely what Hagoort thinks we should do, and on one reading you are suggesting as well. Maybe we are completely wrong! Nope, we aren't. Bding open minded to this kind of global skepticism about the state of play is both wrong and debilitating.<span class="Apple-converted-space"> </span></span><br style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13.5px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 12.5999994277954px; orphans: auto; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;" /><br style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13.5px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 12.5999994277954px; orphans: auto; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;" /><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13.5px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 12.5999994277954px; orphans: auto; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">Last point: you are of course aware that your last sentence is a kind of paradox. Is the only thing we can be sure of is that we can never be sure of anything? Hmm. As you know better than I do, this is NOT what actually happens in statistical practice. There are all sorts of things that are held to be impossible. In any given model the hypothesis space defines the limits of the probable. What's outside has 0 probability. The real fight, always, is what is possible and what not. Only then does probably mean anything.</span> </i><br /> ###############<br /><br /> Since Norbert's blog doesn't allow comments beyond a particular length, I post my response here:<br /><br />Norbert, I agree that my statement, taken literally, if obviously absurd. When I said that we can't be sure of anything, I didn't mean that we can't be sure that we don't think with our kidneys etc. I fully agree (and I would have to be really, really stupid not to agree! ;) that there are many things we can easily rule out as impossible; no experiments needed there (also not in syntactic investigations). I was talking specifically about results using rating studies. Take Sprouse et al's work, which is excellent in my opinion. More work like that should be done, and I'm fully for it, whatever the outcome. My comment was directed at your statement that we can be sure of Sprouse et al's results. I agree that syntacticians have a finely honed ability to sift through data by just using intuition. So I find the Sprouse et al conclusions plausible. My skepticism is of the following nature: it's entirely possible that the things syntacticians have studied so far were, relatively speaking, low hanging fruit. The Sprouse et al results may be convincing for the items studied so far, but they may have limited validity for future work, where judgements could be a lot more variable and unstable. Or they may not replicate (replication is the acid test). Maybe we can take some of the work on negative polarity; we might find that the judgements diverge from expert NPI researchers (where judgements get pretty unstable---Van der Wouden once told me that we shouldn't even consult "ordinary" speakers of a language for NPI, since they won't even have reliable judgements, one has to consult a syntactician). Once we had an NPI specialist over at Ohio State when I was a grad students, and he presented his expert judgements as the basis for his theory; it was easy to find counterexamples in corpora. Or, if we move to a language like Hindi, which has inherently unstable and variable judgements, the judgements of linguists vs a sample from the population of native speakers may differ quite a bit. For example, I was really surprised by the key example in Mahajan's dissertation; it is very hard to "get" the judgement that Mahajan got. Initially I thought I just didn't get it because I wasn't a refined enough individual syntactically, but that was not the case. Simialrly, we have done several rating studies on word order variation in Hindi, with completely unclear and unstable results. But syntacticians working on Hindi are pretty sure about what's OK and what's not OK in these cases (just take monoclausal word order with and without negation: here's a syntactician holding forth on this topic: http://www.ling.uni-potsdam.de/~vasishth/pdfs/VasishthRLC04.pdf. The situation is much less clear than this guy suggests in the paper, if you do a rating study). <br /><br />What I was commenting on was the certainty expressed in the statement "If Sprouse and Alemeida are right (which I assure you they are; read the papers)". Neither you nor I can know whether they are right. They have some evidence for their position, which may or may not replicate or generalize when we go beyond the language and phenomena covered there. <br /><br />PS You said that "One way of taking your point is that one should always be ready to admit that one is wrong. As a theoretical option, this is correct. BUT, I doubt very much anyone actually works in this way." I know at least one person. Take a look at some of my papers:<br /><br /><a href="http://www.ling.uni-potsdam.de/~vasishth/pdfs/FrankTrompenaarsVasishthCogSci.pdf">http://www.ling.uni-potsdam.de/~vasishth/pdfs/FrankTrompenaarsVasishthCogSci.pdf</a><br /><br /><a href="http://www.ling.uni-potsdam.de/~jaeger/publications/JaegerChenLiLinVasishth2015.pdf">http://www.ling.uni-potsdam.de/~jaeger/publications/JaegerChenLiLinVasishth2015.pdf</a><br /><br /><a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0100986">http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0100986</a><br /><br /><a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0077006">http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0077006</a><br /><br />We have more stuff in the works in which we try to break our own favorite story. Ted Gibson has also published against his favored positions. I think more people need to push against their own positions. People don't do that. I am highly suspicious of people who *only* find (or only publish) results favoring their own position. <br />Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com0tag:blogger.com,1999:blog-21621108.post-90006427379476436502015-02-05T10:51:00.000+01:002015-02-05T10:51:56.747+01:00Quantitative methods in linguistics: The danger aheadPeter Hagoort has written a nice piece on his take on the future of linguistics:<br /><br />http://www.mpi.nl/departments/neurobiology-of-language/news/linguistics-quo-vadis-an-outsider-perspective<br /><br />He's very gentle on linguists in this piece. One of his suggestions is to do proper experimental research instead of relying on intuition. Indeed, the field of linguistics is already moving in that direction. I want to point out a potentially dangerous consequence of the move towards quantitative methods in linguistics.<br /><br />My expectation is that with the arrival of more and more quantitative work in linguistics, we are going to see (actually, we are already there) a different kind of degradation in the quality of work done. This degradation will be different from the kind linguistics has already experienced thanks to the tyranny of intuition in theory-building.<br /><br />Here are some things that I have personally seen linguists do (and psycholinguists do this too, even though they should know better!): <br /><br />1. Run an experiment until you hit significance. ("Is the result non-significant? Just run more subjects; it's going in the right direction.")<br />2. Alternatively, if you are looking to prove the null hypothesis, stop early or just run a low power study, where the probability of finding an effect is nice and low.<br />3. Run dozens (in ERP, even more than dozens) of tests and declare significance at 0.05.<br />4. Vary the region of interest post-hoc to get significance. <br />5. Never check model assumptions. <br />6. Never replicate results.<br />7. Don't release data and code with your publication.<br />8. Remove data as needed to get below the 0.05 threshold.<br />9. Only look for evidence in favor of your theory; never publish against your own theoretical position.<br />10. Argue from null results that you actually found that there is no effect.<br />11. Reverse-engineer your predictions post-hoc after the results show something unexpected. <br /><br />I could go on. The central problem is that doing experiments requires a strong grounding in statistical theory. But linguists (and psycholinguists) are pretty cavalier about acquiring the relevant background: have button, will click. No linguist would think of running his sentences through some software to print out his formal analyses; you need to have expert knowledge to do linguistics. But the same linguist will happily gather rating data and run some scripts or press some buttons to get an illusion of quantitative rigor. I wonder why people think that statistical analysis is exempt from the deep background so necessary for doing linguistics. Many people tell me that they don't have the time to study statistics. But the statistics <i>is</i> the science. If you're not willing to put in the time, don't use statistics!<br /><br />I suppose I should be giving specific examples here; but that would just insult a bunch of people and would distract us from the main point, which is that the move to doing quantitative work in linguistics has a good chance of backfiring and leading to a false sense of security that we've found something "real" about language.<br /><br />I can offer one real example of a person I don't mind insulting: myself. I have made many, possibly all, of the mistakes I list above. I started out with formal syntax and semantics, and transitioned to doing experiments in 2000. Everything I knew about statistical analysis I learnt from a four-week course I did at Ohio State. I discovered R by googling for alternatives to SPSS and Excel, which had by then given me RSI. I had the opportunity to go over to the Statistics department to take courses there, but I missed that chance because I didn't understand how deep my ignorance was. The only reason I didn't make a complete fool of myself in my PhD was that I had the good sense to go to the Statistical Consulting section of OSU's Stats department, where they introduced me to linear mixed models ("why are you fitting repeated measures ANOVAs? Use nlme."). It was after I did a one-year course in Sheffield's Statistics department that I finally started to see what I had missed (I reviewed this course <a href="http://vasishth-statistics.blogspot.de/2011/12/part-1-of-2-review-of-graduate.html">here</a>).<br /> <br />For linguistics, becoming a quantitative discipline is not going to give us the payoff that people expect, unless we systematically work at making a formal statistical education a core part of the curriculum. Currently, what's happening is that we have advanced fast in using experimental methods, but have made little progress in developing a solid understanding of statistical inference. <br /><br />Obviously, not everyone who uses experimental methods in linguistics falls into this category. But the problems are serious, both in linguistics (and psycholinguistics), and it's better to recognize this now rather than let thousands of badly done experiments and analyses lead us down some other garden-path.<br /><br /><br />Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com9tag:blogger.com,1999:blog-21621108.post-83745631119690059802015-01-02T10:48:00.003+01:002015-01-02T10:52:32.268+01:00A weird and unintended consequence of Barr et al's Keep It Maximal paperBarr et al's well-intentioned paper is starting to lead to some seriously weird behavior in psycholinguistics! As a reviewer, I'm seeing submissions where people take the following approach:<br /><br />1. Try to fit a "maximal" linear mixed model. If you get a convergence failure (this happens a lot since we routinely run low power studies!), move to step 2.<br /><br />[Aside: <br />By the way, the word maximal is ambiguous here, because you can have a "maximal" model with no correlation parameters estimated, or have one with correlations estimated. For a 2x2 design, the difference would look like:<br /><br />correlations estimated: (1+factor1+factor2+interaction|subject) etc.<br /><br />no correlations estimated: (factor1+factor2+interaction || subject) etc.<br /><br />Both options can be considered maximal.]<br /><br />2. Fit a repeated measures ANOVA. This means that you average over items to get F1 scores in the by-subject ANOVA. But this is cheating and amounts to p-value hacking. This effectively changes the between items variance to 0 because we aggregated over items for each subject in each condition. That is the whole reason why linear mixed models are so important; we can take both between item and between subject variance into account simultaneously. People mistakenly think that the linear mixed model and rmANOVA are exactly identical. If your experiment design calls for crossed varying intercepts and varying slopes (and it always does in psycholinguistics), an rmANOVA is not identical to the LMM, for the reason I give above. In the old days we used to compute minF. In 2014, I mean, 2015, it makes no sense to do that if you have a tool like lmer.<br /><br />As always, I'm happy to get comments on this.<br /><br />Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com5tag:blogger.com,1999:blog-21621108.post-61748137611651266162014-11-30T21:50:00.000+01:002014-11-30T21:50:22.533+01:00Misunderstanding p-valuesThese researchers did a small between-patient study with low power to compare people on 24 hours of dialysis vs 12 hours of dialysis a week. They found that patients in the 24 hour arm had improved blood pressure (reduced intake of BP meds in the 24 hour arm), improved potassium and phosphate levels, and found no significant differences in a quality of life questionnaire given to the two arms. From this, the main conclusion they present is that (italics mine) "extending weekly dialysis hours for 12 months <i>did not improve quality of life</i>, but was associated with improvement of some laboratory parameters and reduced blood pressure requirement."<br /><br />If medical researchers can't even figure out what they can conclude from a null result from a low powered study, they should not be allowed to do such studies. I also looked at <a href="http://www.euroqol.org/fileadmin/user_upload/Documenten/PDF/Products/Sample_UK__English__EQ-5D-3L.pdf">the quality of life questionnaire</a> they used. This questionnaire doesn't even begin to address important indicators of the quality of life of a patient on hemodialysis. A lot depends on the type of life the patient on dialysis was leading before he/she got into the study; what he/she does for a living (if anything), what other health problems he/she has,... These are the things that the questionnaire would measure; the questionnaire doesn't even tackle relevant quality of life variables associated with increased dialysis. <br /><br />So, not only did they draw the wrong conclusion from their null result, the instrument they are using is not even the appropriate one. It would still have been just fine if they had not written "extending weekly dialysis hours for 12 months <i>did not improve quality of life."</i><br /><br />What a waste of money and time this is. It is really disappointing that such poor research passes the rigorous peer review of the <i>Journal of the American Society of Nephrology</i>. Here is what they say in their abstracts book:<br /><b><br /></b><b>"A<span style="font-family: 'HelveticaNeueLTStd'; font-size: 9.000000pt;">bstract submissions were rigorously reviewed and graded by multiple experts.</span>"</b><br /><br />What the journal needs is statisticians reading and vetting these abstracts.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-L9UQRfx8MQU/VHt9QnfY6aI/AAAAAAAAAZ0/1k-6W1p8knQ/s1600/abstract.tiff" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-L9UQRfx8MQU/VHt9QnfY6aI/AAAAAAAAAZ0/1k-6W1p8knQ/s1600/abstract.tiff" height="640" width="552" /></a></div><br /><br /><br /> Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com0tag:blogger.com,1999:blog-21621108.post-77135223775034730912014-11-30T09:35:00.002+01:002014-11-30T20:10:46.676+01:00Response to John KruschkeI wanted to post this reply to <a href="http://doingbayesiandataanalysis.blogspot.de/2014/11/how-can-i-learn-bayesian-modeling.html">John Kruschke's blog post</a>, but the blog comment box does not allow such a long response, so I posted it on my own blog and will link it in the comment box: <br /><br />Hi John,<br /><br />thanks for the detailed responses, and for the friendly tone of your response, I appreciate it. <br /><br />I will try to write a more detailed review of the book to give some suggestions for the next edition, but I just wanted to respond to your comments:<br /><br />1. Price: I agree that it's relative. But your argument assumes a US audience; people are often willing to pay outrageous amounts for things that are priced much more reasonably (and realistically) in Europe. Is the book primarily targeted to the US population? If not, the price is unreasonable. I cannot ask my students to buy this book when much cheaper ones exist. Even Gelman et al release slides that cover the entire or a substantial part of the BDA book. The analogy with calculus book is not valid either; Gilbert Strang's calculus book is available free on the internet, and there are many other free textbooks of very high quality. For statistics, there's Kerns, Michael Lavine's book, and for probability there are several great books available for free. <br /><br />This book is more accessible than BDA and could become the standard text in psycholinguistics/psychology/linguistics. Why not halve the price and make it easier to get hold of? Even better, release a free version on the web. I could then even set it as a textbook in my courses, and I would.<br /><br />2. Regarding the frequentist discussion, you wrote: "The vast majority of users of traditional frequentist statistics don't know why they should bother with taking the effort to learn Bayesian methods." <br /><br />and <br /><br />"Again, I think it's important for beginners to see the contrast with frequentist methods, so that they know why to bother with Bayesian methods."<br /><br />My objection is that the criticism of frequentist methods is not the primary motivation for using Bayesian methods. I agree that people don't understand p-values and CIs. But the solution to that is to educate them so they understand them, the motivation for using Bayes cannot be that people don't understand frequentist methods<br />and/or abuse them. The next step would be to not use Bayesian methods because people who use it don't understand them and/or abuse them.<br /><br />The primary motivation for me for using Bayes is the astonishing flexibility of Bayesian tools. It's not the only motivation, but this one thing outweighs everything else for me. <br /><br />Also, even if the user of frequentist statistics realizes the problems inherent in the abuse of frequentist tools, this alone won't be sufficient to motivate them to move to Bayesian statistics. A more inclusive philosophy would be more effective: for some things a frequentist method is just fine (used properly). For other things you really need Bayes. You don't always need a laser gun; there are times when a hammer would do just fine (my last sentence does not do justice to frequentist tools, which are often really sophisticated).<br /><br />3. "If anything, I find that adherence to frequentist methods require more blind faith than Bayesian methods, which to me just make rational sense. To the extent there is any tone of zealotry in my writing, it's only because the criticisms of p values and confidence intervals can come as a bit of a revelation after years of using p values without really understanding them."<br /><br />I understand where you are coming from; I have also taken the same path of slowly coming to understand what the methodology was really saying, and initially I also fell into the trap of getting annoyed with frequentist methods and rejecting them outright. <br /><br />But I have reconsidered my position and I think Bayes should be presented on its own merits. I can see that relating Bayes and freq. methods is necessary to clarify the differences, but this shouldn't run out of control. In my future courses that is the line I am going to take.<br /><br />When I read material attacking frequentist methods *as a way to get to Bayes*, I am strongly reminded of the gurus in India who use a similar strategy to make their new converts believe in them and drive out any loyalty to the old guru. That is where my analogy to religion is coming from. It's an old method, and I have seen religious zealots espousing "the one right way" using it.<br /><br />4. "Well, yes, that is a major problem. But I don't think it's the only major problem. I think most users of frequentist methods don't understand what a p value and confidence interval really are. "<br /><br />Often, these are the same thing. They are abused by many people because they don't understand them. An example is psycholinguistics, where we routinely publish null results in low power experiments as positive findings. The people who do that are not abusing statistics deliberately, they just don't know that a null result is not informative in their particular settings. Journal editors (top journals) think that a lower p-value gives you more evidence in favor of the specific alternative. They just don't understand it, but they are not involved in deception. <br /><br />The set of people who understand the method and deliberately abuse it is probably nearly the empty set. I don't know anyone in psycholinguistics who understands p-values and CIs and still abuses the method.<br /><br />I'll write more later (and I have many positive comments!) once I've finished reading your 700+ page book! :)Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com2tag:blogger.com,1999:blog-21621108.post-44274758564492403372014-11-25T09:34:00.004+01:002014-11-25T09:34:56.427+01:00Should we fit maximal linear mixed models?Recently, Barr et al published <a href="http://idiom.ucsd.edu/~rlevy/papers/barr-etal-2013-jml.pdf">a paper in the Journal of Memory and Language</a>, arguing that we should fit maximal linear mixed models, i.e., fit models that have a full variance-covariance matrix specification for subject and for items. I suggest here that the recommendation should not be to fit maximal models, the recommendation should be to run high power studies.<br /><br />I released <a href="http://vasishth-statistics.blogspot.de/2014/08/an-adverse-consequence-of-fitting.html">a simulation on this blog</a> some time ago arguing that the correlation parameters are pretty meaningless. Dale Barr and Jake Westfall replied to my post, raising some interesting points. I have to agree with Dale's point that we should reflect the design of the experiment in the analysis; after all, our goal is to specify how we think the data were generated. But my main point is that given the fact that the culture in psycholinguistics is to run low power studies (we routinely publish null results with low power studies and present them as positive findings), fitting maximal models without asking oneself whether the various parameters are reasonably estimable will lead us to miss effects. <br /><br /><b>For me, the only useful recommendation to psycholinguists should be to run high power studies</b>.<br /><br />Consider two cases:<br /><br />1. <b>Run a low power study (the norm in psycholinguistics) where the null hypothesis is false.</b><br /><br />If you blindly fit a maximal model, you are going to miss detecting the effect more often compared to when you fit a minimal model (varying intercepts only). For my specific example below, the proportions of false negatives is 38% (maximal) vs 9% (minimal).<br /><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-Bten9_WCb9c/VHQ7WKCUXNI/AAAAAAAAAZM/TbTP1fRT4_I/s1600/lowpower_correlations.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-Bten9_WCb9c/VHQ7WKCUXNI/AAAAAAAAAZM/TbTP1fRT4_I/s1600/lowpower_correlations.tiff" height="208" width="320" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-4dzNQ_nsUyw/VHQ7Zt72j2I/AAAAAAAAAZU/pRspNLeLqvc/s1600/lowpower_effects.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-4dzNQ_nsUyw/VHQ7Zt72j2I/AAAAAAAAAZU/pRspNLeLqvc/s1600/lowpower_effects.tiff" height="298" width="320" /></a></div><br /><br /><br /><br />In the top figure, we see that under repeated sampling, lmer is failing to estimate the true correlations for items (it's doing a better job for subjects because there is more data for subjects). Even though these are nuisance parameters, trying to estimate them for items in this dataset is a meaningless exercise (and the fact that the parameterization is going to influence the correlations is not the key issue here---that decision is made based on the hypotheses to be tested).<br /><br />The lower figure shows that under repeated sampling, the effect (\mu is positive here, see <a href="http://vasishth-statistics.blogspot.de/2014/08/an-adverse-consequence-of-fitting.html">my earlier post for details</a>) is being missed much more often with a maximal model (black lines, 95% CIs) than with a varying intercepts model (red lines). The difference is in the miss probability is 38% (maximal) vs 9% (minimal).<br /><br /><br /><br />2. <b>Run a high power study.</b><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-9unS9Grusx0/VHQ7_MJYZ6I/AAAAAAAAAZc/SJyhtDOr9a0/s1600/highpower_correlations.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-9unS9Grusx0/VHQ7_MJYZ6I/AAAAAAAAAZc/SJyhtDOr9a0/s1600/highpower_correlations.tiff" height="214" width="320" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-JyYaxAts_fs/VHQ8BVAN8mI/AAAAAAAAAZk/_hRuRZ2y4_I/s1600/highpower_effects.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-JyYaxAts_fs/VHQ8BVAN8mI/AAAAAAAAAZk/_hRuRZ2y4_I/s1600/highpower_effects.tiff" height="284" width="320" /></a></div><br /><br />Now, it doesn't really matter whether you fit a maximal model or not. You're going to detect the effect either way. The upper plot shows that under repeated sampling, lmer will tend to detect the true correlations correctly. The lower plot shows that in almost 100% of the cases, the effect is detected regardless of whether we fit a maximal model (black lines) or not (red lines).<br /><br />My conclusion is that if we want to send a message regarding best practice to psycholinguistics, it should not be to fit maximal models. It should be to run high power studies. To borrow a phrase from Andrew Gelman's blog (or from Rob Weiss's), if you are running low power studies, <a href="http://andrewgelman.com/2014/11/21/youre-using-proper-informative-prior-youre-leaving-money-table/">you are leaving money on the table</a>.<br /><br />Here's my code to back up what I'm saying here. I'm happy to be corrected!<br /><br />https://gist.github.com/vasishth/42e3254c9a97cbacd490<br /><br />Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com2tag:blogger.com,1999:blog-21621108.post-85541202456318790352014-11-22T12:17:00.001+01:002014-11-22T12:17:17.152+01:00Simulating scientists doing experimentsFollowing a discussion on Gelman's blog, I was playing around with simulating scientists looking for significant effects. Suppose each of 1000 scientists run 200 experiments in their lifetime, and suppose that 20% of the experiments are such that the null is true. Assume a low power experiment (standard in psycholinguistics; eyetracking studies even in journals like JML can easily have something like 20 subjects). E.g., with a sample size of 1000, delta of 2, and sd of 50, we have power around 15%. We will add the stringent condition that the scientist has to get one replication of a significant effect before they publish it. <br /><br />What is the proportion of scientists that will publish at least one false positive in their lifetime? That was the question. Here's my simulation. You can increase the effect_size to 10 from 2 to see what happens in high power situations.<br /><br /><script src="https://gist.github.com/vasishth/14192ae70a4ab4d7c56a.js"></script><br /><br />Comments and/or corrections are welcome.<br /><br /><br />Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com4tag:blogger.com,1999:blog-21621108.post-37118488263510573272014-08-23T15:46:00.001+02:002014-08-23T15:46:20.993+02:00An adverse consequence of fitting "maximal" linear mixed models<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-fzCYr70RTZM/U_iX7hhHpTI/AAAAAAAAAYs/wwgDgzpMjb0/s1600/Rplot01.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://1.bp.blogspot.com/-fzCYr70RTZM/U_iX7hhHpTI/AAAAAAAAAYs/wwgDgzpMjb0/s1600/Rplot01.png" height="211" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Distribution of intercept-slope correlation estimates with 37 subjects, 15 items</td></tr></tbody></table><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-I3RA1E3X1mg/U_iX72L1oEI/AAAAAAAAAYw/HrFoTmTQjzo/s1600/Rplot02.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://4.bp.blogspot.com/-I3RA1E3X1mg/U_iX72L1oEI/AAAAAAAAAYw/HrFoTmTQjzo/s1600/Rplot02.png" height="211" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Distribution of intercept-slope correlation estimates with 50 subjects, 30 items</td></tr></tbody></table>Should one always fit a full variance covariance matrix (a "maximal" model) when one analyzes repeated measures data-sets using linear mixed models? Here, I present one reason why blindly fitting ''maximal'' models does not make much sense.<br /><br />Let's create a repeated measures data-set that has two conditions (we want to keep this example simple), and the following underlying generative distribution, which is estimated from the Gibson and Wu 2012 (Language and Cognitive Processes) data-set. The dependent variable is reading time (rt).<br /><br />\begin{equation}\label{eq:ranslp2}<br />rt_{i} = \beta_0 + u_{0j} + w_{0k} + (\beta_1 + u_{1j} + w_{1k}) \hbox{x}_i + \epsilon_i<br />\end{equation}<br /><br />\begin{equation}<br />\begin{pmatrix}<br /> u_{0j} \\<br /> u_{1j}<br />\end{pmatrix}<br />\sim<br />N\left(<br />\begin{pmatrix}<br /> 0 \\<br /> 0<br />\end{pmatrix},<br />\Sigma_{u}<br />\right) <br />\quad<br />\begin{pmatrix}<br /> w_{0k} \\<br /> w_{1k} \\<br />\end{pmatrix}<br />\sim<br />N \left(<br />\begin{pmatrix}<br /> 0 \\<br /> 0<br />\end{pmatrix},<br />\Sigma_{w}<br />\right) <br />\end{equation}<br /><br /><br />\begin{equation}\label{eq:sigmau}<br />\Sigma_u =<br />\left[ \begin{array}{cc}<br />\sigma_{\mathrm{u0}}^2 & \rho_u \, \sigma_{u0} \sigma_{u1} \\<br />\rho_u \, \sigma_{u0} \sigma_{u1} & \sigma_{u1}^2\end{array} \right]<br />\end{equation}<br /><br />\begin{equation}\label{eq:sigmaw}<br />\Sigma_w =<br />\left[ \begin{array}{cc}<br />\sigma_{\mathrm{w0}}^2 & \rho_w \, \sigma_{w0} \sigma_{w1} \\<br />\rho_w \, \sigma_{w0} \sigma_{w1} & \sigma_{w1}^2\end{array} \right]<br />\end{equation}<br /><br />\begin{equation}<br />\epsilon_i \sim N(0,\sigma^2)<br />\end{equation}<br /><br />One difference from the Gibson and Wu data-set is that each subject is assumed to see each instance of each item (like in the old days of ERP research), but nothing hinges on this simplification; the results presented will hold regardless of whether we do a Latin square or not (I tested this).<br /><br />The parameters and sample sizes are assumed to have the following values:<br /><br /><br />* $\beta_1$=487<br />* $\beta_2$= 61.5<br /><br />* $\sigma$=544<br />* $\sigma_{u0}$=160<br />* $\sigma_{u1}$=195<br />* $\sigma_{w0}$=154<br />* $\sigma_{w1}$=142<br />* $\rho_u=\rho_w$=0.6<br />* 37 subjects<br />* 15 items<br /><br />Next, we generate data 100 times using the above parameter and model specification, and estimate (from lmer) the parameters each time. With the kind of sample size we have above, a maximal model does a terrible job of estimating the correlation parameters $\rho_u=\rho_w$=0.6.<br /><br />However, if we generate data 100 times using 50 subjects instead of 37, and 30 items instead of 15, lmer is able to estimate the correlations reasonably well.<br /><br />In both cases we fit ''maximal'' models; in the first case, it makes no sense to fit a "maximal" model because the correlation estimates tend to be over-estimated. The classical method (the generalized likelihood ratio test (the anova function in lme4) to find the ''best'' model) for determining which model is appropriate is discussed in the Pinheiro and Bates book, and would lead us to adopt a simpler model in the first case.<br /><br /> Douglas Bates himself has something to say on this topic:<br /><br />https://stat.ethz.ch/pipermail/r-sig-mixed-models/2014q3/022509.html<br /><br />As Bates puts it:<br /><br />"Estimation of variance and covariance components requires a large number of groups. It is important to realize this. It is also important to realize that in most cases you are not terribly interested in precise estimates of variance components. Sometimes you are but a substantial portion of the time you are using random effects to model subject-to-subject variability, etc. and if the data don't provide sufficient subject-to-subject variability to support the model then drop down to a simpler model. "<br /><br />Here is the code I used:<br /><br /><script src="https://gist.github.com/vasishth/f112e80e2d00147b3476.js"></script><br />Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com4tag:blogger.com,1999:blog-21621108.post-18346565102340291942013-12-17T21:40:00.001+01:002013-12-17T21:40:22.778+01:00lmer vs Stan for a somewhat involved dataset.Here is a comparison of lmer vs Stan output on a mildly complicated dataset from a psychology expt. (Kliegl et al 2011). The data are here: https://www.dropbox.com/s/pwuz1g7rtwy17p1/KWDYZ_test.rda.<br /><br />The data and paper available from: http://openscience.uni-leipzig.de/index.php/mr2<br /><br />I should say that datasets from psychology and psycholinguistic can be much more complicated than this. So this was only a modest test of Stan.<br /><br />The basic result is that I was able to recover in Stan the parameter estimates (fixed effects) that were primarily of interest, compared to the lmer output. The sds of the variance components all come out pretty much the same in Stan vs lmer. The correlations estimated in Stan are much smaller than lmer, but this is normal: the bayesian models seem to be more conservative when it comes to estimating correlations between random effects.<br /><br />Traceplots are here: https://www.dropbox.com/s/91xhk7ywpvh9q24/traceplotkliegl2011.pdf<br /><br />They look generally fine to me.<br /><br />One very important fact about lmer vs Stan is that lmer took 23 seconds to return an answer, but Stan took 18,814 seconds (about 5 hours), running 500 iterations and 2 chains.<br /><br />One caveat is that I do have to try to figure out how to speed up Stan so that we get the best performance out of it that is possible.<br /><br /><script src="https://gist.github.com/vasishth/8012211.js"></script><br /><br />Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com3tag:blogger.com,1999:blog-21621108.post-48838514567333015542013-12-16T09:53:00.000+01:002013-12-16T18:19:55.495+01:00The most common linear mixed models in psycholinguistics, using JAGS and StanAs part of <a href="http://www.ling.uni-potsdam.de/~vasishth/advanceddataanalysis.html">my course in bayesian data analysis</a>, I have put up some common linear mixed models that we fit in psycholinguistics. These are written in JAGS and Stan. Comments and suggestions for improvement are most welcome.<br /><br /><b>Code</b>: <a href="http://www.ling.uni-potsdam.de/~vasishth/lmmexamplecode.txt">http://www.ling.uni-potsdam.de/~vasishth/lmmexamplecode.txt</a><br /><b>Data</b>: <a href="http://www.ling.uni-potsdam.de/~vasishth/data/gibsonwu2012data.txt">http://www.ling.uni-potsdam.de/~vasishth/data/gibsonwu2012data.txt</a><br /><br />Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com0tag:blogger.com,1999:blog-21621108.post-85814594812902811022013-10-08T12:20:00.000+02:002013-10-08T12:20:16.369+02:00New course on bayesian data analysis for psycholinguistics<div dir="ltr" style="text-align: left;" trbidi="on">I decided to teach a basic course on bayesian data analysis with a focus on psycholinguistics. Here is the course website (below). How could this possibly be a bad idea!<br /><br /><a href="http://www.ling.uni-potsdam.de/~vasishth/advanceddataanalysis.html">http://www.ling.uni-potsdam.de/~vasishth/advanceddataanalysis.html</a></div>Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com2tag:blogger.com,1999:blog-21621108.post-42339341486459269202013-03-15T21:48:00.000+01:002013-03-15T21:48:13.988+01:00How are the random effects (BLUPs) `predicted' in linear mixed models?<br /><br /><br />In linear mixed models, we fit models like these (the Ware-Laird formulation--see Pinheiro and Bates 2000, for example):<br /><br />\begin{equation}<br />Y = X\beta + Zu + \epsilon<br />\end{equation}<br /><br />Let $u\sim N(0,\sigma_u^2)$, and this is independent from $\epsilon\sim N(0,\sigma^2)$. <br /><br />Given $Y$, the ``minimum mean square error predictor'' of $u$ is the conditional expectation:<br /><br />\begin{equation}<br />\hat{u} = E(u\mid Y)<br />\end{equation}<br /><br />We can find $E(u\mid Y)$ as follows. We write the joint distribution of $Y$ and $u$ as:<br /><br />\begin{equation}<br />\begin{pmatrix}<br />Y \\<br />u<br />\end{pmatrix}<br />=<br />N\left(<br />\begin{pmatrix}<br />X\beta\\<br />0<br />\end{pmatrix},<br />\begin{pmatrix}<br />V_Y & C_{Y,u}\\<br />C_{u,Y} & V_u \\<br />\end{pmatrix}<br />\right)<br />\end{equation}<br /><br />$V_Y, C_{Y,u}, C_{u,Y}, V_u$ are the various variance-covariance matrices.<br />It is a fact (need to track this down) that<br /><br />\begin{equation}<br />u\mid Y \sim N(C_{u,Y}V_Y^{-1}(Y-X\beta)),<br />Y_u - C_{u,Y} V_Y^{-1} C_{Y,u})<br />\end{equation}<br /><br />This apparently allows you to derive the BLUPs:<br /><br />\begin{equation}<br />\hat{u}= C_{u,Y}V_Y^{-1}(Y-X\beta))<br />\end{equation}<br /><br />Substituting $\hat{\beta}$ for $\beta$, we get:<br /><br />\begin{equation}<br />BLUP(u)= \hat{u}(\hat{\beta})C_{u,Y}V_Y^{-1}(Y-X\hat{\beta}))<br />\end{equation}<br /><div><br /></div><div>Here is a working example:</div><div><br /></div><br /><br /><br /><div><script src="https://gist.github.com/vasishth/5172988.js"></script><br /><div><div><br /></div><div><br /></div></div></div>Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com1tag:blogger.com,1999:blog-21621108.post-70904051961934578832013-03-15T21:00:00.001+01:002013-03-17T19:53:09.442+01:00Correlations of fixed effects in linear mixed modelsEver wondered what those correlations are in a linear mixed model? For example:<br /><div><br /></div><div><script src="https://gist.github.com/vasishth/5172589.js"></script></div><div><br /></div><div>The estimated correlation between $\hat{\beta}_1$ and $\hat{\beta}_2$ is $0.988$. Note that</div><div><br /></div><div>$\hat{\beta}_1 = (Y_{1,1} + Y_{2,1} + \dots + Y_{10,1})/10=10.360$</div><div><br /></div><div>and </div><div><br /></div><div>$\hat{\beta}_2 = (Y_{1,2} + Y_{2,2} + \dots + Y_{10,2})/10 = 11.040$</div><div><br /></div><div>From this we can recover the correlation $0.988$ as follows:</div><div><br /></div><div><div><script src="https://gist.github.com/vasishth/5172613.js"></script></div><div><div><div><br /></div></div></div></div><div>By comparison, in the linear model version of the above:</div><div><br /></div><div><script src="https://gist.github.com/vasishth/5172668.js"></script></div><div><br /></div><div>because $Var(\hat{\beta}) = \hat{\sigma}^2 (X^T X)^{-1}$.</div><div><br /></div>Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com2tag:blogger.com,1999:blog-21621108.post-2345129117456480992013-01-23T22:45:00.002+01:002013-01-23T22:45:58.992+01:00Linear models summary sheetAs part of my long slog towards statistical understanding, I started making notes on the very specific topic of linear models. The details are tricky and hard to keep in mind, and it is difficult to go back and forth between books and notes to try to review them. So I tried to summarize the basic ideas into a few pages (the summary sheet is not yet complete).<br /><br />It's not quite a cheat sheet, so I call it a summary sheet.<br /><br />Here is the current version:<br /><br /><a href="https://github.com/vasishth/StatisticsNotes">https://github.com/vasishth/StatisticsNotes</a><br /><br />Needless to say (although I feel compelled to so it), the document is highly derivative of lecture notes I've been reading. Corrections and comments and/or suggestions for improvement are most welcome.Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com1tag:blogger.com,1999:blog-21621108.post-20167977024985586002012-03-03T20:21:00.005+01:002012-03-03T20:21:56.308+01:00Cauchy and determinants: when life was simple<div dir="ltr" style="text-align: left;" trbidi="on">" In Cauchy's day, when life was simple and matrices were small, determinants played a major role in analytic geometry and other parts of mathematics."<br /><br />Lay, p. 202 [Linear Algebra and its Applications, 3rd Edition (Update)]</div>Shravan Vasishthhttp://www.blogger.com/profile/13453158922142934436noreply@blogger.com0