Saturday, February 14, 2015

Getting a statistics education: Review of the MSc in Statistics (Sheffield)


[This post was written between Sept 2012 and Feb 2015. I will post an update in Sept. 2015]

Last edit: June 27, 2015

Some background:

I started using statistics for my research sometime in 1999 or 2000. I was a student at Ohio State, Linguistics, and I had just gotten interested in psycholinguistics. I knew almost nothing about statistics at that time. I did one Intro to Stats course in my department with Mike Broe (4 weeks),  and that was it. In 1999 I developed repetitive strain injury, partly from using Excel and SPSS, and started googling for better statistical software. Someone pointed me to |stat, but eventually I found R. That was a transformative moment.

The next stage in my education came in 2000, when I decided to go to the Statistical Consulting department at OSU and showed them my repeated measure ANOVA analyses. The response I got was: why are you fitting ANOVAs? You need linear mixed models. The statisticians showed me what I had to do code-wise, and I went ahead and finished my dissertation work using the nlme package. The Pinheiro and Bates book had just come out then and I got myself a copy, understanding almost nothing in the book beyond the first few chapters.

After that, I published a few more papers on sentence processing using nlme and then lmer, and in 2011 I co-wrote a book with Mike Broe (the basic template of the book was based on his lecture notes at OSU, he had used Mathematica or something like that, but I used R and expanded on his excellent simulation-based approach). This book revealed the incompleteness of my understanding, as spelled out in the scathing (and well-deserved) critique by Christian Robert. Even before this review came out, I had already realized in early 2011 that I didn't really understand what I was doing. My sabbatical was coming up in winter 2011, and I enrolled for the graduate certificate in statistics at Sheffield to get a better understanding of statistical theory. Here is my review of the distance-based graduate certificate in statistics taught at Sheffield.

At the end of that graduate certificate, I felt that I still didn't really understand much that was of practical relevance to my life as a researcher. That led me to do the MSc in Statistics at Sheffield, which I have been doing over three years (2012-15). This is a review of the MSc program. I haven't actually finished the program yet, but I think I know enough to write the review. My hope is that this overview will provide others a guide-map on one possible route one can take to achieving better understanding of data analysis, and what to expect if one takes this route.

Short version of this review: The three year distance MSc program at Sheffield is outstanding. I highly recommend it to anyone wanting to acquire a good, basic understanding of statistical theory and inference. You can alternatively do the course over two years (probably impossible or very hard if you are also working full time, like me), or over one year full time (I don't know how people can do the degree in one year and still enjoy it). Be prepared to work hard and to find your own answers.

Long version:

Cost: For EU citizens, the three-year part time program costs about 2000 British pounds a year, not including the travel costs to get to Sheffield for the annual exams and presentations.  For non-EU citizens, it's about 5000 pounds a year, still cheaper than most US programs.

Summary notes of the MSc program: I made summary notes for the exams during the three years. These are still very much in progress and are available from:

https://github.com/vasishth/MScStatisticsNotes

The courses I found most interesting and practically useful for my own research were Linear Modelling, Inference (Bayesian Statistics and Computational Inference), Medical Statistics, and Dependent data (Multivariate Analysis).

Course structure: Over three years, one does two courses each year, plus a dissertation.  One has to commit about 15-20 hours a week in the 3-year program, although I think I did not do that much work, more like 12 hours a week on average (I had a lot of other work to do and just didn't have enough time to devote to statistics). There are four 3 hour sort-of open book exams that one has to go to Sheffield for, plus a group oral presentation, a simulated consultation, and project submissions.  Every course has regular assignments/projects, all are graded but only a subset count for the final exam (15% of the final grade). The minimum you have to get to pass is 50%.

The MSc program is taught to residential students and to distance students in parallel: the residentials are there in Sheffield, attending lectures etc. The distance students follow the course over a mailing list.  So, someone like me, who's doing the course over three years, is going to overlap with three batches of the MSc residential students. This has the effect that one has no classmates one knows, except maybe others who are doing the same three-year sequence with you.

The exams, which are the most stressful part of the program, are open book in that one can bring lecture notes and one's own but no textbooks. However, the exams are designed in such a way that if you don't already know the material inside out, there is almost no point in taking lecture notes in with you---there won't be enough time to look up the notes. I did take the official lecture notes with me for the first three exams, but I never once opened them. Instead, I only relied on my own summary sheets. Also, the exams are designed so that most people can't finish the required questions (any 5 out of 6) in the three hours. At least I never managed to finish all the questions to my satisfaction in any exam.

The first year (2012-13)

The first year courses were 6002 (Stats Lab) and 6003 (Linear Modelling). There was a project-based assessment for the first, and a 3 hour exam for the second.

6002 (Stats Lab): most of the course was about learning R, which anyone who had done the grad certificate did not need. It was only in the last weeks that things got interesting, with optimization. I didn't like the notes on optimization and MLE much, though. There wasn't enough detail, and I had to go searching in books and on the internet to find comprehensive discussions. Here I would recommend Ben Bolker's chapters 6-8, which are on his web page, complete with .Rnw files. Also, I just found a neat looking book (not read yet) which I wish I had had in 2012: Modern Optimization with R.

Overall the Stats Lab course had the feel of an intro to R, which is what it should have been called. It should have been possible to test out of such a course---I did not need to read the first 12 of 13 chapters over 9 months, I could have done it in a week or less, I'm sure that's true for those of my classmates who did the graduate certificate. However, I do see the point of the course for non-R users.  I guess this is the perennial problem of teaching; students come in with different levels, you have to cater to the lowest common denominator. Also, the introduction to R is pretty dated and needs a major overhaul. Much has happened since Hadley Wickham arrived on the scene, and it's a shame not to use his packages. Finally, the absence of literate programming tools was surprising to me. I expected it to be a standard operating procedure in statistics to use Sweave or the like.

6003 (Linear Modelling): this course was absolutely amazing.  The lecture notes were very well-written and very detailed (with some exceptions, noted below). Linear mixed models didn't get a particularly detailed treatment; I would have preferred a matrix presentation of LMM theory, and would have liked to learn how to implement these models myself.

Some problems I faced in year 1:
One issue in the course was the slow return of corrected assignments. By the time the assignment comes back graded (well, we just get general feedback and a grade), you've forgotten the details. Another strange aspect is that the grades for assignments were sometimes sent by regular air-mail. This was surprising in an online course.

One frustrating aspect of the courses was that a number of statements were made without any justification, proof, or further explanation. Example: "In R the default choice is the corner-point constraints given above, but in SPlus the default is the Helmert form, which is more convenient computationally, though more difficult to interpret." Wow, I want to know more! But this point is never discussed again. One consequence is a feeling that one must simply take certain facts as given (or work it out yourself). I think it would have been helpful to point the interested student to a reference.

The responses to questions on the mailing list are sometimes slow to come.   Answers to questions asked online sometimes didn't really address the question, and one was left in the same state of uncertainty as earlier (a familiar feeling when you talk to a statistician!).

Where the graduate certificate shone was in the excruciatingly detailed feedback; this was where I learnt the most in that course. By contrast, the feedback to some of the assignments was pretty sketchy. I never really knew what a perfect solution would have looked like.

Of course, I can see why all this happens: professors are busy, and not always able to respond quickly to questions. I myself am sometimes just as slow to respond as a teacher; I guess I need to work on that aspect of my own teaching.

My final marks in these first-year courses were 63 per cent in each course.

The second year (2013-14)

The second year courses were 6001 (Data Analysis) and 6004 (Inference: Bayesian Statistics and Computational Inference). There was a project-based assessment for the first, and a 3 hour exam for the second.

In Data Analysis we did several projects which simulated real-life consulting, or involved doing actual experiments (e.g., building aeroplanes). There was one project where one had to choose a news media article about a piece of scientific work, and then compare it with the actual scientific work. The consulting project didn't work so well for me, because we were teamed up in fives and we didn't know each other. It was very hard to coordinate a project when all your colleagues are unknown to you, and email is the only way to communicate.

For the news media article, I chose the article Gelman attacked on his blog, about women wearing red to signal sexual availability. It was interesting because the claims in the Psych Science didn't really pan out. I reanalyzed the original data, and found that the effect was driven by pink, not red; the authors had recoded red and pink as red or pink, presumably in order to make the claim that women wear reddish hues. It's hard to believe that this was not a post-hoc step after seeing the data (although I think the authors claim it was not---I suppose it's possible that it wasn't); after all, if they had originally intended to treat red and pink as one unit color type, then why did they have two columns, one for red and one for pink?

The Data Analysis course was definitely not challenging; it was rather below the level of data analysis I have to do in my own research.  However, I was thankful not to be overloaded in this course because the Bayesian analysis course took up all my energy in my second year.

The course on Bayesian statistics was a whole other animal. I read a lot of books that were not assigned as required readings (mostly, Gelman et al's BDA3, and Lunn et al, but also Lynch's excellent textbook). I did all the three exercises that were assigned (these are graded but do not count for the final grade). My scores were 20/20, 22/30, 23/30. I never really understood what exactly led to those points being lost; not much detailed explanation was provided. One doesn't know how many marks one loses for making a figure too small, for example (I was following Gelman's example of showing lots of figures, which requires making them smaller, but evidently this was frowned upon). As is typical for this degree program, the grading is pretty harsh and tight-lipped (the harsh grading is not a bad thing; but the lack of information on what to improve in the answer was frustrating).

The Bayesian lecture notes could be improved. They seem to have a disjointed feel; perhaps they were written by different people.  The Bayesian lecture notes were very different than, say, the linear modeling notes, which really drilled the student on practical details of model fitting. In the Bayesian course, there were sudden transitions to topics that fizzled out quickly and were never resurrected. An example is decision theory; one section starts out defining some basic concepts, and then quickly ends. Inference and decision theory was never discussed. There were sections that were in the notes but not needed for the exams; for an MSc level program I would have wanted to read that material (and did). I had some questions on these non-examinable sections, but never could get an answer, which was pretty frustrating.

The biggest thing that could be improved in these lecture notes is to provide more contact with code. Unfortunately, WinBUGS was introduced, and very late in the course, and then a fairly major project (which counts for the final grade) was assigned that was based entirely on modeling in WinBUGS. Apart from the fact that WinBUGS is just not a well-designed software (JAGS or Stan is much better), not much practice was given in fitting models, certainly not as much as was given for linear modelling. Model fitting should be an integral part of the course from the outset, and WinBUGS should be abandoned in favor of JAGS.

If I had not done a lot of reading on my own, and not learnt JAGS and Stan, I would have really suffered in this course.  Maybe that's what the lecture notes are intending to do: it's a graduate-level course, and maybe the expectation is that one looks up the details on one's own.

As it was, I enjoyed doing the Bayesian exercises, which were very neat problems---just hard enough to make you think, but not so hard that you can't solve them if you think hard and do your own research.

One thing that was never discussed in the Bayesian data analysis course was how to do statistical inference, for example in factorial $2\times 2$ repeated measures designs. Textbooks on Bayesian methods don't discuss this either; perhaps they consider it enough that you get the posterior; you can draw your own conclusions from that.

I got scores in the mid 60s for each course. I think I had 63 in Data Analysis and 67 in Inference.

The third year

The third year courses were MAS6011 (Dependent data) and MAS6012 (Sampling, Design, Medical Statistics). There is a 3 hour exam for each course.

The dependent data course was truly amazing. In the first semester, I got to grips with multivariate analysis, and with some interesting data mining type of tools such as PCA and linear discriminant analysis.  The lecture notes could have been a lot more detailed for a graduate program; the lack of detail was probably due to the fact that undergrads and grad students were mixed in in the same class. The second semester was about time-series analysis, and was the best taught course and the most exciting I took in this MSc. For the first time, video lectures are being provided every week, and these are proving to be extremely helpful.

What really resonated with me in this course was state space modeling. I wish the whole course had been about that topic; the ARIMA modeling framework of Box and Jenkins is really amazing but pales in significance when you see what SSMs can do. Maybe it would have been better to teach a two-semester sequence instead of compression a Data Mining type of course into the first semester, and TS into the second. I would happily have done another course instead of doing those Stats Labs and such like "soft" courses, as I mention elsewhere.

The Medical Statistics course was fascinating because it was here that one finally saw issues being dealt with where people's lives would be at stake depending on the answer we obtain. One amazing fact I discovered is that Pocock 1983 considers power below 70% in an experiment to be unethical. Psycholinguists and psychologists routinely run low power studies and publish their null results in prestigious journals. Luckily nobody will die as a result of these studies! Another amazing fact is that frequentist statistics is standard practice in medicine. I would have expected that Bayesian stats would dominate in such a vitally important application of statistics. I am willing to use p-values to make a binary decision to help a journal editor feel good about the paper, but not if I am deciding whether drug X will help stave off death for a patient. I am really glad that I do not need to enter the job market as a statistician. If I were starting out my career after finishing this degree, I would probably have done into a pharma company, and it is horrifying to think that I would be forced to deliver p-values as decision-making tool.

For the first semester, the medstats lecture notes were not that well written, with not much detail, full of typos and bullet point type presentations. The slides had no page numbers. These lecture notes and slides need a major overhaul in my opinion. I didn't get any detailed feedback on the first two exercises I submitted, and the feedback I did get I could not read as it was handwritten with one of those ball-point pens that don't steadily deliver ink. The feedback, such as it was, came in unusually late as well. By contrast, the survival analysis lecture notes were much better, and I learnt a lot.

The second semester lecture notes and slides were on the design of experiments and sampling theory (stratified sampling, cluster sampling, capture-recapture sampling, etc.).  The DoE part was outstanding; for the first time, I learnt how optimal experimental design is set out, and learnt to determine optimality of design using the General Equivalence Theorem. I think I would have liked to have this course right after Linear Modelling (in the three year distance program, this course and LM are separate by a year of coursework on computational statistics and Bayesian Data Analysis), although the gap did have one advantage that linear modeling theory had some time to sink in before I studied experiment design.   I was less excited by the sampling part, but I think that this is because I am probably never going to be doing sample surveys. I just couldn't whip up enough enthusiasm for that topic, but I did hunker down and learn everything anyway.  The second semester also came with weekly video recordings, so for the first time I was able to watch the same lecture that the residential students were attending.

Update:

I got 67% in Medical Statistics and 70% in Dependent Data (a distinction, my first in this MSc program!).

This was much better than I expected; I write slowly (I enjoy writing with my high quality gold plated, lacquered Namiki Pilot fountain pen, and the sheer pleasure of having the pen glide over paper, leaving mesmerizing, exquisite strokes of black ink, slows me down a lot), and so I knew I would not be able to finish the papers, and I didn't. But I guess the questions that I answered I must have done reasonably OK. For these two exams, I practiced a lot more with the hand calculator too, and I noticed that practice makes me... well, not perfect, but better. I did stop making stupid mistakes like forgetting that log is by default to the base 10 and I have explicitly ask for a log_e, and mistyping multiplication when I meant division. (In the BDA exam I actually managed to get a probability greater than 1 in one answer due to this kind of idiocy. Since I didn't have time to go back to fix my mistake, I just wrote "doesn't make sense, there must be a calculation error somewhere", hoping that the grader will realize that I understand the method but can't type on a hand-held).

The MSc Dissertation:

There's also a thesis to be written as part of the MSc; that counts for 60 credits in the 180 credit MSc program. I would have preferred to do more coursework than do the thesis, but I can see why a thesis is required (all our programs in Potsdam require them too).  More on that in September or October 2015.

General comments/suggestions for improvement:

1. The MSc currently has three specializations: Statistics, Medical Statistics, and Financial Statistics. Each has slightly different requirements (e.g., for Financial, you need to demonstrate specific math ability).  I would add a fourth specialization, to reflect the needs of statisticians today. This could be called Computational Statistics or something like that.

In this specialization, one could require a background in R programming, just as Financial Stats requires advanced math. One could replace Stats lab and Data Analysis with a course on Statistical Computing (following some subset of the contents of textbooks like Eubank et al, Eddenbeutel, Cortez, Hadley), and Statistical Learning (aka Data Mining), following a textbook like James et al.  I am sure that such a specialization is badly needed; see, for example, the puzzled question asked by a statistician not so long ago in AMSTAT news: Aren't we data science? One can't prepare statisticians as "data scientists" if they don't have serious computing ability.

Some of the data mining related materials turns up in Dependent Data in year 3, and that's fine; there is much more that one needs exposure to today. For me, the Stats Lab and Data Analysis courses did not have enough bang for the buck. I can see that such courses could be useful to newcomers to R and data analysis (but at the grad level, I find it hard to believe that the student would have never seen R; I guess it's possible).

But these courses didn't really challenge me to deal with real-life problems one might be likely to encounter as a future statistician (writing one's own packages, solving large-scale data mining problems).  If there had been a more computationally oriented stream which assumed R, I would have taken that route.

Some MS(c) programs with the kind of focus I am suggesting:
a.  St Andrews: http://www.creem.st-and.ac.uk/datamining/structure.html
b. Another one in Sweden: http://www.liu.se/utbildning/pabyggnad/F7MSM/courses?l=en
c. Stanford: https://statistics.stanford.edu/academics/ms-statistics-data-science

2. The lectures could have easily been recorded, this would have greatly enhanced the quality of the MSc. All you need is slides and a screen capture software with audio recording capability.

[Update: SOMAS now records the lectures in real time, and posts them on youtube. This has significantly improved course quality in my opinion, because it allows you watch an expert do the derivations on the board, and learn by copying/modeling that expert's approach to problem solving.]

3. The real value added in the MSc is the exercises, and the feedback after the exercises have been submitted. This is the only way that one learns new things in this course (apart from reading the lecture notes). The written exams are of course a crucial part of the program, but the solutions and one's own attempt are never released so one has only a limited opportunity to learn from one's mistakes in the exam. For about 2000 pounds a year, this is quite a bargain.  Basically this is equivalent to hiring a statistician for 33 hours at 60 pounds an hour each year, with the big difference that you leave the table knowing much more than when you arrived.

4. Some ideas that were difficult for me:
- Expectation of a function of random variables was taught in the grad cert in 2011, but I needed it for the first time in 2014, when studying the EM algorithm. It would have been helpful to see a practical application early.
- The exponential distribution is a key distribution and needs much more study, esp. in connection with modeling survival. Perhaps more time should be spent studying distributions and their interrelationships.
- The derivation of full conditional distributions could have been tightly linked to DAGs, as is done in the Lunn et al book. It was only after I read the Lunn et al book that I really understood how to work out the full conditional distribution in any (within reason) given Bayesian model.
- I learnt how to compute eigenvalues and eigenvectors in the graduate certificate, but didn't use this knowledge until 2014, when I did Multivariate Analysis. I didn't even understand the relevance of eigenvalues etc. until I saw the discussion on Principal Components Analysis. A tighter linkage between mathematical concepts and their application in statistics would be useful.
- Similarly, Lagrangian multipliers became extremely useful when we started looking at PCA and Linear Discriminant Analysis; I saw them in 2011 and forgot all about them. There must be some way to show the applications of mathematical ideas in statistics. After much searching, I found this useful book that does part of the job.

5. The entire MSc program basically provides the technical background needed to understand major topics in statistics; there is not enough time to go into much detail. Each chapter in each course could have been a full course (e.g., the EM algorithm). I think that the real learning will not begin until I start to apply these ideas to new problems (as opposed to, say, using already known routines like linear mixed models). So, what I can say is that after four years of hard work, I know enough to actually start learning statistics. I don't feel like I really know anything; I just know the lay of the land.

6. The MSc is heavily dependent on R. Not having a python component to the course limits the student greatly, especially if they are going to go out there into the world as a ''data scientist''. The Enthought on-demand courses are a fantastic supplement to the MSc coursework. It would be a good idea to have a python course of that type in the MSc coursework as well.

7. One mistake I made from the perspective of exam-taking was not to spend enough time during the year using the hand-calculator (actually, I spent no time on this). In the exam, the difference between a distinction and an upper second can be the speed with which you can compute (correctly!) on a calculator. I am terrible at this, rarely even able to do simple calculations correctly on a hand-held (I'm talking about really basic operations), simply because I don't use calculators in real life; who does? I would have much preferred exams that test analytical ability rather than ability to do calculations quickly on a calculator. In the real world one uses computers to do calculations anyway. I was also hindered by the fact that I am half-blind (a side effect of kidney failure when I wa 20) and can't even see the hand-calculator's screen properly.

8. One peculiar aspect, and this permeated the MSc program, was the fairly antiquated instructions to students for using LaTeX etc. I think that statisticians should lead the way and use tools like Sweave and Knitr.

9. The textbook recommendations are out of date should be regularly revised. The best textbooks I found for each course that had exams associated with it:

Linear modelling: An Introduction to Generalized Linear Models, Dobson et al

Dobson et al is the best textbook I have ever read on generalized linear models, bar maybe McCullagh and Nelder. Dobson et al was a recommended book in the linear modeling course, a very good choice.

Bayesian Statistics: Lynch, Lunn et al, BDA3, Box and Tiao

Lynch is the best first book to read for Bayes (if you know calculus), and Lunn et al is very useful indeed, and beautifully written. It prepares you well for doing practical data analysis.  Unfortunately, it's oriented towards WinBUGS, but one can translate the code easily to JAGS. In my opinion, WinBUGS was a great first attempt, but it should be retired now, because it is just so painful to use. People should go straight to JAGS (thanks to Martyn Plummer for doing just a fantastic job with JAGS) and then (or alternatively) Stan (thanks to Matt Hoffman, Bob Carpenter, Andrew Gelman and the Stan team for making it possible to use Bayes for really complex problems). You really need both JAGS and Stan in order to read and understand books, especially if you are just starting out.

I recommend reading Box and Tiao at the very end, to get a taste of (a) outstanding writing quality, and (b) what it was like to do Bayes in the pre-historic era (i.e., the 1970s).

Computational Inference: Statistical Computing with R, Rizzo
This book covers pretty much all of computational inference in a very user-friendly way,

Multivariate Analysis: Mathematical Tools for Applied Multivariate Analysis, By Carroll et al.

This book is very heavy going and not an after-five kind of book, it needs serious and slow study. I used it mostly as a reference book.

Medical Statistics (Survival Analysis): Regression Modeling Strategies by Harrell, and Dobson et al. I found the presentation of Survival Analysis in Harrell's book particularly helpful.

Concluding remarks

This MSc program is very valuable for someone willing to work hard on their own, with rather variable amounts of guidance from the instructors. It provides a lot of good-quality structure, and it allows you to check your understanding objectively by way of exams.

Doing this MSc changed a lot of things for me professionally:

Teaching:

- I rewrote my lecture notes, abandoning the statistics textbook I had written in 2011. The Sheffield coursework played a huge role in helping me clean up my notes. I think these notes still need a lot of work, and I plan to work on them during my coming sabbatical.

-I started teaching undergrad Math as a prerequisite to my more technically oriented stats courses.

- I started teaching Bayesian statistics as a standard part of the graduate linguistics coursework. There doesn't seem to be much interest among most linguistics students in this stuff, but I do attract a very special type of student in these classes and that makes teaching more fun.

- I started teaching linear (mixed) modeling in a way aligns much more with standard presentations in the Sheffield MSc program.

- At least one of my students has taken advantage of Bayesian methods in their research, so it's starting to have an impact. 

Research:

- One thing that became clear (if it wasn't obvious already) is that becoming a professional statistician or at least acquiring professional training in statistics is a necessary condition to doing analyses correctly, but it isn't a sufficient condition. Statisticians usually are unable to address concerns from people in specific areas of research because they have no domain knowledge. It seems that without domain knowledge, statistical knowledge is basically useless. One should not go to statisticians seeking "recommendations" on what to do in particular situations.  Depending on which statistician you talk to, you can get a very variable answer. Coupled with knowledge of your research area and knowledge of statistical theory (which of course you have to acquire, just as you acquired your domain knowledge), you have to work out the answer to your particular problem.

- I have essentially abandoned null hypothesis significance testing and just use Bayesian methods. The linear modeling and Bayesian statistics plus computational inference courses were instrumental in making this transition possible. I still report p-values, but only because reviewers and editors of journals insist on them.

- I run high-powered studies whenever possible (e.g., it's not possible to run high power studies with aphasic populations, at least not at Potsdam). Everything else is a waste of time and money.

- I started posting all data and code online as soon as the associated paper is published.

-I spend a lot of time visualizing the data and checking model assumptions before settling on a model.

- I use bootstrapping a lot more to check whether my results hold up compared to more conventional methods. 

- I try to replicate my results, and try to publish replications both of my own work and of others (much more difficult than I anticipated---people think replication is irrelevant and uninformative once someone has published a result with p less than 0.05.

- I can understand books like BDA3. This was not true in 2011. That was the biggest gain of putting myself through this thing; it made me literate enough to read technical introductions.

- I have started working on statistical problems and trying to publish methods papers. Two recent examples:

http://arxiv.org/abs/1506.06201
http://arxiv.org/abs/1506.04967



Monday, February 09, 2015

Another comment on Hornstein's comments on Hagoort

On his blog, Norbert Hornstein had the following exchange. The original Hagoort post is here.

##############
NH: " If Sprouse and Alemeida are right (which I assure you they are; read the papers) then there is nothing wrong with the data that GGers use."

SV: One should never be 100% sure of anything. There is always uncertainty and we should openly discuss the range of possibilities whenever we present a conclusion, not just argue for one position. That has been a problem in psychology, with overly strong conclusions, and that is a problem in linguistics, experimentally driven or not. But this is specially relevant for statistical inference. We can never be sure of anything.

NH: But I think that I disagree with your second point about being sure. One way of taking your point is that one should always be ready to admit that one is wrong. As a theoretical option, this is correct. BUT, I doubt very much anyone actually works in this way. Do you really leave open the option that, for example, thinking takes place in the kidneys and not the brain? Is it a live option for you that you see through the ears and see through the eyes? Is if a live option for you that gravitational attraction is stronger than electromagnetic forces over distances of 2 inches? We may be wrong about everything we have learned, but we this is a theoretical, not what in the 17th century was called a moral possibility. Moreover, there is a real down side to keeping too open a mind, which is what genuflecting to this theoretical option can engender. I find refuting flat earthers and climate science denialists a waste of intellectual time and effort. Is it logically possible that they are right? Sure. Is it morally possible? No. Need we open our minds to their possibilities? No. Should we? No. Same IMO with that GGers have found out about language. There are many details I am willing to discuss, but I believe that it is time to stop acting as if the last 60 years of results might one day go up in smoke. That's not being open minded, or it this is what being open minded requires, then so much the worse for being open minded.

Let me say this another way: there are lots of things I expect to change over the course of the next 25 years of work in linguistics. However, there are many findings that I believe are settled effects. We will not wake up tomorrow and discover that reflexives resist binding or that all unbounded dependencies are created equal. These are not established facts, though there may be some discussion of the limits of their relevance. But they won't all go away. But this is precisely what Hagoort thinks we should do, and on one reading you are suggesting as well. Maybe we are completely wrong! Nope, we aren't. Bding open minded to this kind of global skepticism about the state of play is both wrong and debilitating. 

Last point: you are of course aware that your last sentence is a kind of paradox. Is the only thing we can be sure of is that we can never be sure of anything? Hmm. As you know better than I do, this is NOT what actually happens in statistical practice. There are all sorts of things that are held to be impossible. In any given model the hypothesis space defines the limits of the probable. What's outside has 0 probability. The real fight, always, is what is possible and what not. Only then does probably mean anything.

 ###############

 Since Norbert's blog doesn't allow comments beyond a particular length, I post my response here:

Norbert, I agree that my statement, taken literally, if obviously absurd. When I said that we can't be sure of anything, I didn't mean that we can't be sure that we don't think with our kidneys etc. I fully agree (and I would have to be really, really stupid not to agree! ;) that there are many things we can easily rule out as impossible; no experiments needed there (also not in syntactic investigations).  I was talking specifically about results using rating studies. Take Sprouse et al's work, which is excellent in my opinion. More work like that should be done, and I'm fully for it, whatever the outcome. My comment was directed at your statement that we can be sure of Sprouse et al's results. I agree that syntacticians have a finely honed ability to sift through data by just using intuition. So I find the Sprouse et al conclusions plausible.  My skepticism is of the following nature: it's entirely possible that the things syntacticians have studied so far were, relatively speaking, low hanging fruit. The Sprouse et al results may be convincing for the items studied so far, but they may have limited validity for future work, where judgements could be a lot more variable and unstable. Or they may not replicate (replication is the acid test).  Maybe we can take some of the work on negative polarity; we might find that the judgements diverge from expert NPI researchers (where judgements get pretty unstable---Van der Wouden once told me that we shouldn't even consult "ordinary" speakers of a language for NPI, since they won't even have reliable judgements, one has to consult a syntactician). Once we had an NPI specialist over at Ohio State when I was a grad students, and he presented his expert judgements as the basis for his theory; it was easy to find counterexamples in corpora.  Or, if we move to a language like Hindi, which has inherently unstable and variable judgements, the judgements of linguists vs a sample from the population of native speakers may differ quite a bit. For example, I was really surprised by the key example in Mahajan's dissertation; it is very hard to "get" the judgement that Mahajan got. Initially I thought I just didn't get it because I wasn't a refined enough individual syntactically, but that was not the case. Simialrly, we have done several rating studies on word order variation in Hindi, with completely unclear and unstable results. But syntacticians working on Hindi are pretty sure about what's OK and what's not OK in these cases (just take monoclausal word order with and without negation: here's a syntactician holding forth on this topic: http://www.ling.uni-potsdam.de/~vasishth/pdfs/VasishthRLC04.pdf. The situation is much less clear than this guy suggests in the paper, if you do a rating study).

What I was commenting on was the certainty expressed in the statement "If Sprouse and Alemeida are right (which I assure you they are; read the papers)". Neither you nor I can know whether they are right. They have some evidence for their position, which may or may not replicate or generalize when we go beyond the language and phenomena covered there.

PS You said that "One way of taking your point is that one should always be ready to admit that one is wrong. As a theoretical option, this is correct. BUT, I doubt very much anyone actually works in this way." I know at least one person. Take a look at some of my papers:

http://www.ling.uni-potsdam.de/~vasishth/pdfs/FrankTrompenaarsVasishthCogSci.pdf

http://www.ling.uni-potsdam.de/~jaeger/publications/JaegerChenLiLinVasishth2015.pdf

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0100986

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0077006

We have more stuff in the works in which we try to break our own favorite story.  Ted Gibson has also published against his favored positions. I think more people need to push against their own positions. People don't do that. I am highly suspicious of people who *only* find (or only publish) results favoring their own position.

Thursday, February 05, 2015

Quantitative methods in linguistics: The danger ahead

Peter Hagoort has written a nice piece on his take on the future of linguistics:

http://www.mpi.nl/departments/neurobiology-of-language/news/linguistics-quo-vadis-an-outsider-perspective

He's very gentle on linguists in this piece. One of his suggestions is to do proper experimental research instead of relying on intuition. Indeed, the field of linguistics is already moving in that direction. I want to point out a potentially dangerous consequence of the move towards quantitative methods in linguistics.

My expectation is that with the arrival of more and more quantitative work in linguistics, we are going to see (actually, we are already there) a different kind of degradation in the quality of work done. This degradation will be different from the kind linguistics has already experienced thanks to the tyranny of intuition in theory-building.

Here are some things that I have personally seen linguists do (and psycholinguists do this too, even though they should know better!):

1. Run an experiment until you hit significance. ("Is the result non-significant? Just run more subjects; it's going in the right direction.")
2. Alternatively, if you are looking to prove the null hypothesis, stop early or just run a low power study, where the probability of finding an effect is nice and low.
3. Run dozens (in ERP, even more than dozens) of tests and declare significance at 0.05.
4. Vary the region of interest post-hoc to get significance.
5. Never check model assumptions.
6. Never replicate results.
7. Don't release data and code with your publication.
8. Remove data as needed to get below the 0.05 threshold.
9. Only look for evidence in favor of your theory; never publish against your own theoretical position.
10. Argue from null results that you actually found that there is no effect.
11. Reverse-engineer your predictions post-hoc after the results show something unexpected.

I could go on. The central problem is that doing experiments requires a strong grounding in statistical theory. But linguists (and psycholinguists) are pretty cavalier about acquiring the relevant background: have button, will click. No linguist would think of running his sentences through some software to print out his formal analyses; you need to have expert knowledge to do linguistics. But the same linguist will happily gather rating data and run some scripts or press some buttons to get an illusion of quantitative rigor. I wonder why people think that statistical analysis is exempt from the deep background so necessary for doing linguistics.  Many people tell me that they don't have the time to study statistics. But the statistics is the science. If you're not willing to put in the time, don't use statistics!

I suppose I should be giving specific examples here; but that would just insult a bunch of people and would distract us from the main point, which is that the move to doing quantitative work in linguistics has a good chance of backfiring and leading to a false sense of security that we've found something "real" about language.

I can offer one real example of a person I don't mind insulting: myself. I have made many, possibly all, of the mistakes I list above. I started out with formal syntax and semantics, and transitioned to doing experiments in 2000. Everything I knew about statistical analysis I learnt from a four-week course I did at Ohio State.  I discovered R by googling for alternatives to SPSS and Excel, which had by then given me RSI. I had the opportunity to go over to the Statistics department to take courses there, but I missed that chance because I didn't understand how deep my ignorance was.  The only reason I didn't make a complete fool of myself in my PhD was that I had the good sense to go to the Statistical Consulting section of OSU's Stats department, where they introduced me to linear mixed models ("why are you fitting repeated measures ANOVAs? Use nlme.").  It was after I did a one-year course in Sheffield's Statistics department that I finally started to see what I had missed (I reviewed this course here).
 
For linguistics, becoming a quantitative discipline is not going to give us the payoff that people expect, unless we systematically work at making a formal statistical education a core part of the curriculum. Currently, what's happening is that we have advanced fast in using experimental methods, but have made little progress in developing a solid understanding of statistical inference.

Obviously, not everyone who uses experimental methods in linguistics falls into this category. But the problems are serious, both in linguistics (and psycholinguistics), and it's better to recognize this now rather than let thousands of badly done experiments and analyses lead us down some other garden-path.


Friday, January 02, 2015

A weird and unintended consequence of Barr et al's Keep It Maximal paper

Barr et al's well-intentioned paper is starting to lead to some seriously weird behavior in psycholinguistics! As a reviewer, I'm seeing submissions where people take the following approach:

1. Try to fit a "maximal" linear mixed model.  If you get a convergence failure (this happens a lot since we routinely run low power studies!), move to step 2.

[Aside:
By the way, the word maximal is ambiguous here, because you can have a "maximal" model with no correlation parameters estimated, or have one with correlations estimated. For a 2x2 design, the difference would look like:

correlations estimated: (1+factor1+factor2+interaction|subject) etc.

no correlations estimated: (factor1+factor2+interaction || subject) etc.

Both options can be considered maximal.]

2. Fit a repeated measures ANOVA. This means that you average over items to get F1 scores in the by-subject ANOVA. But this is cheating and amounts to p-value hacking. This effectively changes the between items variance to 0 because we aggregated over items for each subject in each condition. That is the whole reason why linear mixed models are so important; we can take both between item and between subject variance into account simultaneously. People mistakenly think that the linear mixed model and rmANOVA are exactly identical. If your experiment design calls for crossed varying intercepts and varying slopes (and it always does in psycholinguistics), an rmANOVA is not identical to the LMM, for the reason I give above. In the old days we used to compute minF.  In 2014, I mean, 2015, it makes no sense to do that if you have a tool like lmer.

As always, I'm happy to get comments on this.

Sunday, November 30, 2014

Misunderstanding p-values

These researchers did a small between-patient study with low power to compare people on 24 hours of dialysis vs 12 hours of dialysis a week. They found that patients in the 24 hour arm had improved blood pressure (reduced intake of BP meds in the 24 hour arm), improved potassium and phosphate levels, and found no significant differences in a quality of life questionnaire given to the two arms. From this, the main conclusion they present is that (italics mine) "extending weekly dialysis hours for 12 months did not improve quality of life, but was associated with improvement of some laboratory parameters and reduced blood pressure requirement."

If medical researchers can't even figure out what they can conclude from a null result from a low powered study, they should not be allowed to do such studies. I also looked at the quality of life questionnaire they used. This questionnaire doesn't even begin to address important indicators of the quality of life of a patient on hemodialysis. A lot depends on the type of life the patient on dialysis was leading before he/she got into the study; what he/she does for a living (if anything), what other health problems he/she has,... These are the things that the questionnaire would measure; the questionnaire doesn't even tackle relevant quality of life variables associated with increased dialysis.

So, not only did they draw the wrong conclusion from their null result, the instrument they are using is not even the appropriate one. It would still have been just fine if they had not written "extending weekly dialysis hours for 12 months did not improve quality of life."

What a waste of money and time this is. It is really disappointing that such poor research passes the rigorous peer review of the Journal of the American Society of Nephrology. Here is what they say in their abstracts book:

"Abstract submissions were rigorously reviewed and graded by multiple experts."

What the journal needs is statisticians reading and vetting these abstracts.




 

Response to John Kruschke

I wanted to post this reply to John Kruschke's blog post, but the blog comment box does not allow such a long response, so I posted it on my own blog and will link it in the comment box:

Hi John,

thanks for the detailed responses, and for the friendly tone of your response, I appreciate it.

I will try to write a more detailed review of the book to give some suggestions for the next edition, but I just wanted to respond to your comments:

1. Price: I agree that it's relative. But your argument assumes a US audience; people are often willing to pay outrageous amounts for things that are priced much more reasonably (and realistically) in Europe. Is the book primarily targeted to the US population? If not, the price is unreasonable. I cannot ask my students to buy this book when much cheaper ones exist. Even Gelman et al release slides that cover the entire or a substantial part of the  BDA book. The analogy with calculus book is not valid either; Gilbert Strang's calculus book is available free on the internet, and there are many other free textbooks of very high quality. For statistics, there's Kerns, Michael Lavine's book, and for probability there are several great books available for free.

This book is more accessible than BDA and could become the standard text in psycholinguistics/psychology/linguistics. Why not halve the price and make it easier to get hold of? Even better, release a free version on the web. I could then even set it as a textbook in my courses, and I would.

2. Regarding the frequentist discussion, you wrote: "The vast majority of users of traditional frequentist statistics don't know why they should bother with taking the effort to learn Bayesian methods." 

and

"Again, I think it's important for beginners to see the contrast with frequentist methods, so that they know why to bother with Bayesian methods."

My objection is that the criticism of frequentist methods is not the primary motivation for using Bayesian methods. I agree that people don't understand p-values and CIs. But the solution to that is to educate them so they understand them, the motivation for using Bayes cannot be that people don't understand frequentist methods
and/or abuse them. The next step would be to not use Bayesian methods because people who use it don't understand them and/or abuse them.

The primary motivation for me for using Bayes is the astonishing flexibility of Bayesian tools. It's not the only motivation, but this one thing outweighs everything else for me.

Also, even if the user of frequentist statistics realizes the problems inherent in the abuse of frequentist tools, this alone won't be sufficient to motivate them to move to Bayesian statistics. A more inclusive philosophy would be more effective: for some things a frequentist method is just fine (used properly). For other things you really need Bayes. You don't always need a laser gun; there are times when a hammer would do just fine (my last sentence does not do justice to frequentist tools, which are often really sophisticated).

3. "If anything, I find that adherence to frequentist methods require more blind faith than Bayesian methods, which to me just make rational sense. To the extent there is any tone of zealotry in my writing, it's only because the criticisms of p values and confidence intervals can come as a bit of a revelation after years of using p values without really understanding them."

I understand where you are coming from; I have also taken the same path of slowly coming to understand what the methodology was really saying, and initially I also fell into the trap of getting annoyed with frequentist methods and rejecting them outright.

But I have reconsidered my position and I think Bayes should be presented on its own merits. I can see that relating Bayes and freq. methods is necessary to clarify the differences, but this shouldn't run out of control. In my future courses that is the line I am going to take.

When I read material attacking frequentist methods *as a way to get to Bayes*, I am strongly reminded of the gurus in India who use a similar  strategy to make their new converts believe in them and drive out any loyalty to the old guru.   That is where my analogy to religion is coming from.  It's an old method, and I have seen religious zealots espousing "the one right way" using it.

4. "Well, yes, that is a major problem. But I don't think it's the only major problem. I think most users of frequentist methods don't understand what a p value and confidence interval really are. "

Often, these are the same thing. They are abused by many people because they don't understand them. An example is psycholinguistics, where we routinely publish null results in low power experiments as positive findings. The people who do that are not abusing statistics deliberately, they just don't know that a null result is not informative in their particular settings. Journal editors (top journals) think that a lower p-value gives you more evidence in favor of the specific alternative. They just don't understand it, but they are not involved in deception.

The set of people who understand the method and deliberately abuse it is probably nearly the empty set.  I don't know anyone in psycholinguistics who understands p-values and CIs and still abuses the method.

I'll write more later (and I have many positive comments!) once I've finished reading your 700+ page book! :)

Tuesday, November 25, 2014

Should we fit maximal linear mixed models?

Recently, Barr et al published a paper in the Journal of Memory and Language, arguing that we should fit maximal linear mixed models, i.e., fit models that have a full variance-covariance matrix specification for subject and for items. I suggest here that the recommendation should not be to fit maximal models, the recommendation should be to run high power studies.

I released a simulation on this blog some time ago arguing that the correlation parameters are pretty meaningless.  Dale Barr and Jake Westfall replied to my post, raising some interesting points. I have to agree with Dale's point that we should reflect the design of the experiment in the analysis; after all, our goal is to specify how we think the data were generated. But my main point is that given the fact that the culture in psycholinguistics is to run low power studies (we routinely publish null results with low power studies and present them as positive findings), fitting maximal models without asking oneself whether the various parameters are reasonably estimable will lead us to miss effects.

For me, the only useful recommendation to psycholinguists should be to run high power studies.

Consider two cases:

1. Run a low power study (the norm in psycholinguistics) where the null hypothesis is false.

If you blindly fit a maximal model, you are going to miss detecting the effect more often compared to when you fit a minimal model (varying intercepts only). For my specific example below, the proportions of false negatives is 38% (maximal) vs 9% (minimal).







In the top figure, we see that under repeated sampling, lmer is failing to estimate the true correlations for items (it's doing a better job for subjects because there is more data for subjects). Even though these are nuisance parameters, trying to estimate them for items in this dataset is a meaningless exercise (and the fact that the parameterization is going to influence the correlations is not the key issue here---that decision is made based on the hypotheses to be tested).

The lower figure shows that under repeated sampling, the effect (\mu is positive here, see my earlier post for details) is being missed much more often with a maximal model (black lines, 95% CIs) than with a varying intercepts model (red lines). The difference is in the miss probability is 38% (maximal) vs 9% (minimal).



2. Run a high power study.




Now, it doesn't really matter whether you fit a maximal model or not. You're going to detect the effect either way. The upper plot shows that under repeated sampling, lmer will tend to detect the true correlations correctly. The lower plot shows that in almost 100% of the cases, the effect is detected regardless of whether we fit a maximal model (black lines) or not (red lines).

My conclusion is that if we want to send a message regarding best practice to psycholinguistics, it should not be to fit maximal models. It should be to run high power studies. To borrow a phrase from Andrew Gelman's blog (or from Rob Weiss's), if you are running low power studies, you are leaving money on the table.

Here's my code to back up what I'm saying here. I'm happy to be corrected!

https://gist.github.com/vasishth/42e3254c9a97cbacd490

Saturday, November 22, 2014

Simulating scientists doing experiments

Following a discussion on Gelman's blog, I was playing around with simulating scientists looking for significant effects. Suppose each of 1000 scientists run 200 experiments in their lifetime, and suppose that 20% of the experiments are such that the null is true. Assume a low power experiment (standard in psycholinguistics; eyetracking studies even in journals like JML can easily have something like 20 subjects). E.g., with a sample size of 1000, delta of 2, and sd of 50, we have power around 15%. We will add the stringent condition that the scientist has to get one replication of a significant effect before they publish it.

What is the proportion of scientists that will publish at least one false positive in their lifetime? That was the question. Here's my simulation. You can increase the effect_size to 10 from 2 to see what happens in high power situations.



Comments and/or corrections are welcome.