Search

Saturday, April 04, 2020

Developing the right mindset for learning statistics: Some suggestions

Developing the right mindset for learning statistics: Some suggestions

Introduction

Over the last few decades, statistics has become a central part of the linguist’s toolkit. In psychology, there is a long tradition of using statistical methods for data analysis, but linguists and other cognitive scientists are relative newcomers to this area, and the formal statistics coursework provided in graduate programs is still quite sketchy. For example, as a grad student at Ohio State, in 1999 or 2000 or so, I did a four-week intensive course on statistics, after which I could do t-tests and ANOVAs on my data using JMP. Even in psychology departments, the amount of exposure students get to statistics varies a lot.

As part of Potsdam’s graduate linguistics/cognitive science/cognitive systems programs, we teach a sequence of five courses involving data analysis and statistics:

  • (Winter) Statistical data analysis 1
  • (Winter) Bayesian statistical inference 1
  • (Winter) Case studies in psycholinguistics
  • (Summer) Statistical data analysis 2
  • (Winter) Bayesian statistical inference 2

In addition, we teach (in winter) a Foundations of Mathematics course that covers undergraduate calculus, probability theory, and linear algebra. This course is designed for people who plan to take the machine learning classes in computer science, as part of the MSc in Cognitive Systems.

Students sometimes have difficulties while doing these courses. This is because there is an art to taking these courses that is not obvious. This short note is aimed at spelling out some important aspects of this art.

In my experience, anyone can learn this way of approaching the study of statistics, which is inherently difficult. Keep in mind that when learning something new, one might not understand everything, but that’s OK. The whole world is built on partial understanding (I myself have only a very incomplete picture of statistics, and it’s likely to stay that way). Someone once told me that that the key difference between a mathematician and a “normal” preson is that the mathematician will keep reading or listening even if they are not following the details of the presentation. One can learn to become comfortable with partial understanding, safe in the knowledge that one can come back to the open questions later.

Below, I am shamelessly going to borrow from this (to my mind) classic book:

Burger, E. B., & Starbird, M. (2012). The 5 elements of effective thinking. Princeton University Press.

I strongly advise you to read the Burger and Starbird book; it’s short and very practically oriented. I re-read it once a year on average just to remind myself of the main ideas.

My comments below are specifically oriented towards the learning of statistics as my colleagues and I teach it at Potsdam, so my examples are very specifically about the material I teach. The examples are really the only thing I add beyond what’s in the Burger and Starbird book.

Developing the right mindset: A checklist

Understand the “easy” stuff deeply

Ask yourself: when starting the study of statistics, what is the basic knowledge I will need (I review all these topics in my introductory classes)? You will not be in a position to answer this question when you start your studies, but after completing one or two courses you should revisit this question.

  • The basic elements of probability theory (sum rule, product rule, conditional probability, law of total probability)
  • Basic high-school algebra (e.g., given \(y = \frac{x}{1-x}\), solve for \(x\))
  • How to deal with exponents: \(x^2 \times x^3 = ?\) Is it \(x^5\) or \(x^6\)? We learnt this in school but we forgot it because we didn’t use it for many years. But now we need this knowledge!
  • What is a log? What is log(1)? What is log(0)? How to find out if one has forgotten?
  • What is a probability distribution? This requires some careful navigation. The key concepts here are the probability mass function (discrete case), probability density functions (continuous case), cumulative distribution functions. In bivariate/multivariate distributions, conditional, marginal, and joint distributions must be well-understood intuitively. The key here is to develop graphical intuition, using simulation. I teach this approach in my courses. Statisticians use calculus when discussing the properties of probability distributions. However, we can do all this graphically and lose no information. In practice, we rarely or never need to do any analytical work involving mathematical derivations; the software does all the work. However, it is important to understand the details intuitively, and here figures help a lot. A basic rule of thumb is: whenever trying to understand something, try to visualize it graphically. Even something mundane like repeated coin tosses can be graphically visualized, and then everything becomes clear.

Going back repeatedly to these foundational ideas as one advances through the courses is very important. The goal should be to internalize them deeply, through graphical intuition.

Mistakes are your friend and teacher

Throughout our school years, we are encouraged to deliver the right answers, and penalized for delivering wrong answers. This style of schooling misses the point that mistakes can teach us more than our correct answers, if we compare the expectd answer with ours and try to work out what we got wrong and why. This is called “error learning” or something like that in machine learning, and it works with humans too. Don’t be afraid to make mistakes, but try to make only new mistakes, and keep learning from them.

Students generally assume that I will judge them if they get something wrong. This is a false impression. As I say above, you can learn more from a mistake than from a correct answer. In my own studies of statistics, you can see that my grades are not stellar, they are all online:

https://vasishth-statistics.blogspot.com/2015/02/getting-statistics-education-review-of.html

Despite my mediocre grades, I still learnt a lot. Similarly, in graduate school, at Ohio State, my grades were just OK to so-so, nothing to write home about. In computer science (Ohio State), my grades were usually in the range of B+. I rarely got an A-. I still learnt important and useful stuff.

How to develop curiosity: Solve the same problem more than one way, and generate your own questions

The Burger and Starlight book encourages the reader to become curious about a problem. Here, I suggest a very concrete strategy, e.g., when doing homework assignments.

  • First, create some mental space and time. Don’t try to squeeze the homework assignment into the last two hours before the submission deadline. Create a clear day ahead of you to explore a problem. I know that courses are designed these days to require at most 2-3 hours of work per week at home. This is an unfortunate productionalization of education that is now hurting the education system in Europe. If you need to stick to that tght schedule, do what you can in the limited time, but even there it is good to not leave the work to the last hours before submission. If you create more time, use it to explore in the following way.
  • Second, assuming you have some extra time, try to solve the given problem using different approaches. E.g., if the assignment asks you to use a lognormal likelihood in a linear mixed model, ask yourself if there is some way to solve the problem with the standard normal likelihood. If the problem asks you to work with brms, try to also solve the problem using Stan or even rstanarm, even if the assignment doesn’t ask you to do this. You are doing this for yourself, not for submitting the assignment. Even if the assignment doesn’t ask you to change the priors in a model, fool around with them to see what happens to the posteriors. If there is an LKJ(2) prior on a correlation parameter in the linear mixed model, find out what happens if you use LKJ(0.5) or LKJ(10). Etc.
  • Ask yourself what-if questions. Suppose you are learning about power analysis using simulation, a topic I cover in all my advanced classes, Bayesian or frequentist. This topic is ripe for exploration. Power depends essentially on three variables: effect size, sample size, and standard deviation. That is a fertile playground! I have spent so much time playing with power analyses that I can give ballpark estimates for my research problems quite accurately, without any simulation (of course, I always check my answers using simulation!). There are actually several different ways to compute power; you can use power.t.test, you can do it using simulation, etc. This topic is perfect for developing a sense of curiosity, but youc an do this for really any topic.

Keep careful notes

Statistics is not to be trifled with. I don’t expect anyone to memorize any formulas, but the logic of the analytical steps can get confusing. Keep good records of your learning. As an example, here is my entire record of four years of formal statistics study at the University of Sheffield (I did an MSc online, part time). These are cheat sheets I prepared while studying:

https://github.com/vasishth/MScStatisticsNotes

These notes are way more mathematical than anything I will teach at Potsdam. However, the principle is: organize your understanding of the material yourself. Don’t just let the teacher organize it for you (the teacher does do that, through slides and lecture notes!). We only understand things if we can actively produce and reorganize them ourselves.

Have a real problem you want to solve, and start simple

Usually, you will learn the most when you are desperate to get the answer to a data analysis problem. You will be working in a very small world of your own, and you know your problem, you are motivated to solving it. This is very different from homework assignments given out of the blue by the teacher. For this reason, especially in statistics courses, it is useful to come to the course with a specific problem you want to solve. As the course unfolds, apply the methods you learn to your problem. For example, suppose your supervisor has already told you that you need to fit a generalized linear mixed model with a logit link function to the data. Where to start?

Suppose you are taking a frequentist course and know that at the end of the course you need to be able to complete the data analysis your supervisor asked you to deal with. You can start by simplifying the problem radically and working with what you already know. Could you run a t-test instead? It doesn’t matter that someone told you that that’s the wrong test; we are playing here. Could you just fit a simple linear model (again wrong, but this is exploration). Just these two exercises will leave us with a lot of interesting insights to explore. Once you learn about linear mixed models, you can start exploring whether you can fit the model with the standard lmer function and what it would tell you. Once you reach that point, you are close to getting to the analysis you were told to do. Even if I don’t teach it in class, you can use the last trick to get there, which I discuss next.

“Let me google that for you”: Learn to find information

Any time someone asks you a question you consider easily answered by googling, and you feel like being mean, you can use this website to deliver a sarcastic response: https://lmgtfy.com/. You simply type in the question, and then send the link to the person asking the question. When they click on it, the question is typed into the google search window, and you are invited to click on the search button. It’s a pretty passive aggressive thing to do, and I advise you to never use this approach. :)

But despite the nasty aspect of the LMFTY website, it does illustrate an important point: these days you can find a lot of information online. Here are some ways that I use the internet:

  • When I get an error message in RStudio I don’t understand (this happens pretty much daily), I just copy it and paste it into google’s search engine. Almost always, someone has had that same problem before and posted a solution. You have to be patient sometimes and look at a lot of the search engine results; but eventually you will find the answer. One gets better at this with experience. Sometimes one can’t solve the problem (e.g., I have a minor ongoing problem with Cairo fonts); it’s OK to give up and move on when it isn’t critical to the work one is doing.
  • For Bayesian data analysis, there are online forums one can ask questions at. E.g., discourse.mc-stan.org for Stan. For frequentist questions, there are R mailing lists (exercise: google them!).
  • Stackexchange. I have gotten authoritative answers from distinguished scientists about math problems that I don’t have the technical knowledge to solve. Often, someone else has asked a similar question already, so it can happen that one doesn’t even need to ask.
  • Google scholar gives you access to scientific articles via keyword search.
  • Blogs: I use Feedly to follow R-bloggers and other blogs like Andrew Gelman’s. Over time I have learnt a lot from reading blog posts.

Obviously, googling is not a fail-safe strategy. Sometimes you will get incorrect information. What I generally do is try to cross-check any technical claims from other sources like textbooks.

A common complaint in my statistics courses is that I don’t teach enough R. That’s because one can never teach enough R. One has to keep looking stuff up as needed; this is the skill that I am suggesting that you acquire.

Look for connections between ideas

Often, statistics is taught like a random catalogue of tests: t-test, ANOVA, linear mixed model, Fisher exact test, etc., etc. Interestingly, however, many of these seemingly disparate ideas have deep connections. The t-value and the F-score are connected; the t-test and the linear mixed model are connected. Figuring out these relationships analytically is not difficult but one needs some background to work it out. For example, see

https://vasishth-statistics.blogspot.com/2018/04/a-little-known-fact-paired-t-test-is.html

Even if one doesn’t know enough to carry out this analytical derivation, one can play with data to get a feel for the connection. The way I first got a hint about the t-test and linear mixed model connection (discussed above analytically) was by simulating data and then analyzing it two different ways (t-test vs linear mixed model), and getting the exact same statistics. It was only much later that I saw how to work this out analytically. The point is that simulation will get you very far in such investigations. You may not be able to prove stuff mathematically (I usually can’t), but you can still gain insight.

Getting further in your study of statistics

It is possible to take the Potsdam courses and do solid statistical analyses. However, if you get curious about the underlying mathematics, or want to read more advanced textbooks, or want to get into the machine learning field, we teach a Foundations of Mathematics course that graduate students can take. Historically, people have benefitted from taking this course even if they had no previous math exposure in university. So this course is definitely optional and most people can skip it; but it’s available for anyone interested in going deeper.

Sunday, February 16, 2020

Installing papaja on Windows 10

I recently bought a Windows 10 2-in-1 machine (Dell Latitude) so that I could record my video lectures for class and write on the tablet while recording. I usually use Mac OS or Ubuntu. I haven't used Windows since maybe 2008.

One thing I needed was to install is papaja in RStudio. This turned out to be a nightmare. I kept getting all kinds of unfamiliar error messages which I had never seen before. 

The problem turned out to be that Windows stores the packages in a library but is unable to delete them when a newer version has to be installed. You have to:

1. Change directory to the library location, which in my case was C:/Users/vasishth/Documents/R/win-library/3-6. I don't know if Windows has a terminal, but I changed the directory in the Windows Finder-equivalent window.
2.  Then, manually delete the packages that R complains about when installing papaja.
3. Restart RStudio, and then use install.packages to install all the offending packages. You may first have to remove a file called 00_Lock* or something in the directory equivalent to the one above on your machine.
4. Now, you can install papaja from the stable or development version as usual.
5. Restart RStudio and papaja will be present in the templates.

Saturday, October 05, 2019

Estimating the carbon cost of psycholinguistics conferences

Estimating the carbon cost of psycholinguistics conferences

Shravan Vasishth

10/5/2019

Note: If I have made some calculation error, please point it out and I will fix it.

Suggestion


At the University of Potsdam we are discussing how to reduce our carbon footprint in science-related work. One thought I had was that we could reduce our carbon footprint in psycholinguistics significantly (I bet nobody will believe I just used that word!) if we had the CUNY and AMLaP conference on alternate years as opposed to every year.


Objections to alternating years for CUNY and AMLaP, and responses


I posed this question to my community, and got some interesting responses:

Response 1: More people would start traveling to each conference, neutralizing the gains


“…it’s not clear whether this would actually reduce the net amount of travel. In particular, if there will now be just one language conference (well, a language conference with a focus on experimental and computational approaches to language), with an alternating location (US vs. Europe), more US-based researchers may come to AMLaP than before, and more Europe-based researchers may come to CUNY than before. It seems important to assess this in figuring out whether this would help.”
Response: This is a reasonable point, but it presupposes that people will disregard the environmental cost of flying and only look out for their self-interest. I think this is unlikely; most people seem to be aware of how urgent this problem is. Most students at Potsdam seem to be very concerned and want to know how carbon emissions can be reduced (also by them); I assume this is the same in the US. If a regional conference (e.g., LSA, and plenty of European conferences each year) is substituted for an international one, the data don’t seem to support such a shift (see this paper).


Response 2: Cohesiveness of the community would be damaged, and inequality would increase


“This was also my thought when I read this idea. Also, it would likely impact access and participation, especially for students. Only the best funded labs will be able to send people to conferences overseas, which means that many people will not participate every year, and the cohesiveness of our field will suffer.”
Response: This is also a reasonable point. But it is already the case that the best funded labs are the only ones able to send people to conferences, and nobody does anything about it. But even if there is some additional cost of this nature, there is no free lunch. The idea that one can do something to contribute to reducing environmental damage without giving up a single thing is unrealistic. The question is whether the cost is worth it. Not having a world to travel in at all might be too high a cost compared to these other costs.



Some back-of-the-envelope calculations


One paper states: “On a per capita basis, CO2 emissions for the ESA meetings ranged from 0.46-0.66 metric tons. The estimated per capita AAG carbon footprint, 0.58 metric tons of carbon dioxide, fell within this range of values.” p 67, Ponette‐Gonzàlez et al 2011.
Using the previous years’ AMLaP attendance counts, we have the following numbers of attendees:
dat<-read.table("amlapdat.txt",header=TRUE)

dat
##   year total

## 1 2015   194
## 2 2016   305
## 3 2017   300
## 4 2018   298
Taking the above estimates of .46-.66 metric tons per person on average, the minimum and maximum emissions in metric tons per conference range from:
dat$mincost<-dat$total*.46

summary(dat$mincost)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 

##    89.2   125.1   137.5   126.2   138.6   140.3
dat$maxcost<-dat$total*.66

summary(dat$maxcost)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 

##     128     180     197     181     199     201
The full range looks like this:
dat

##   year total mincost maxcost

## 1 2015   194   89.24  128.04
## 2 2016   305  140.30  201.30
## 3 2017   300  138.00  198.00
## 4 2018   298  137.08  196.68
So each AMLaP conference generates about 90 to 201 metric tonnes of carbon emissions. CUNY may be comparable, perhaps a bit larger, so at the upper end. Simply multiplying by two, our annual carbon emission would then be estimated to be:
## minimum (metric tons)

90*2
## [1] 180

## maximum (metric tons)

201*2
## [1] 402

The paper says: “Depending on the model, we estimated an average 18-59% reduction in carbon emissions for multiple regional compared with national meetings.” p 67.
For the smallest conference that we have data on, an 18%-59% reduction would amount to a emissions ranging from:
180-.18*180 ## down from 180 

## [1] 147.6

180-.59*180 ## down from 180

## [1] 73.8

For the largest conference we have data on, an 18%-59% reduction would amount to a emissions ranging from:
201-.18*201 ## down from 201 

## [1] 164.82

201-.59*201 ## down from 201

## [1] 82.41

What does a maximum reduction of 119 metric tons (201-82) mean? As a baseline, consider that India produced 1.7 metric tons per capita in an entire year (2014; you can google this).


Conclusion


There could be a significant reduction in carbon emissions. There will be a cost of course, but it may not be environmental. (There could be unintended environmental costs such as people starting to producing more babies as a result of not going to conferences; publishing more papers; or have more time to produce more trained psycholinguists per year.)
In particular, hoping that we can go on with business as usual is guaranteed to lead to a net loss.


Future directions







  • The above analysis is probably very coarse-grained. One could do a more principled analysis of emission costs by using data from CUNY 2020 and AMLaP 2020. Since Brian Dillon and I are holding these two conferences, we could coordinate our analyses. Incidentally, in conferences in general, about 10% of the attendees account for 50% of the emissions; one could take the individual-level cost into account in a more nuanced manner.





  • One could simply implement the change from 2021 and track the change in carbon emissions over the years pre-2021 and post-2021 to see if the fear that carbon emissions will go up instead of down is realized. I offer myself to carry out that analysis. If it goes up, it would obviously be a bad idea and should be scrapped. Based on the papers I read, I would be pre-register my prediction that that will not happen. Obviously, it would be too late to do anything by the time enough data come in.


  • Comments on this post are welcome, and suggestions for improvement, or corrections, are also most welcome!






















    Sunday, September 23, 2018

    Recreating Michael Betancourt's Bayesian modeling course from his online materials

    Several people wanted to have the slides from Betancourt's lectures at SMLP2018. It is possible to recreate most of the course from his writings:

    1. Intro to probability:
    https://betanalpha.github.io/assets/case_studies/probability_theory.html

    2. Workflow:
    https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html

    3. Diagnosis:
    https://betanalpha.github.io/assets/case_studies/divergences_and_bias.html

    4. HMC: https://www.youtube.com/watch?v=jUSZboSq1zg

    5. Validating inference: https://arxiv.org/abs/1804.06788

    6. Calibrating inference: https://arxiv.org/abs/1803.08393


    Thursday, August 16, 2018

    How to capitalize on a priori contrasts in linear (mixed) models: A tutorial

    We wrote a short tutorial on contast coding, covering the common contrast coding scenarios, among them: treatment, helmert, anova, sum, and sliding (successive differences) contrasts.  The target audience is psychologists and linguists, but really it is for anyone doing planned experiments.
     
    The paper has not been submitted anywhere yet. We are keen to get user feedback before we do that. Comments and criticism very welcome. Please post comments on this blog, or email me.
     
    Abstract:

    Factorial experiments in research on memory, language, and in other areas are often analyzed using analysis of variance (ANOVA). However, for experimental factors with more than two levels, the ANOVA omnibus F-test is not informative about the source of a main effect or interaction. This is unfortunate as researchers typically have specific hypotheses about which condition means differ from each other. A priori contrasts (i.e., comparisons planned before the sample means are known) between specific conditions or combinations of conditions are the appropriate way to represent such hypotheses in the statistical model. Many researchers have pointed out that contrasts should be "tested instead of, rather than as a supplement to, the ordinary `omnibus' F test" (Hayes, 1973, p. 601). In this tutorial, we explain the mathematics underlying different kinds of contrasts (i.e., treatment, sum, repeated, Helmert, and polynomial contrasts), discuss their properties, and demonstrate how they are applied in the R System for Statistical Computing (R Core Team, 2018). In this context, we explain the generalized inverse which is needed to compute the weight coefficients for contrasts that test hypotheses that are not covered by the default set of contrasts. A detailed understanding of contrast coding is crucial for successful and correct specification in linear models (including linear mixed models). Contrasts defined a priori yield far more precise confirmatory tests of experimental hypotheses than standard omnibus F-test.


     Full paper: https://arxiv.org/abs/1807.10451

    Thursday, July 26, 2018

    Stan Pharmacometrics conference in Paris July 24 2018

    I just got back from attending this amazing conference in Paris:

    http://www.go-isop.org/stan-for-pharmacometrics---paris-france

    A few people were disturbed/surprised by the fact that I am linguist ("what are you doing at an pharmacometrics conference?"). I hasten to point out that two of the core developers of Stan are linguists too (Bob Carpenter and Mitzi Morris). People seem to think that all linguists do is correct other people's comma placements. However, despite my being a total outsider to the conference, the organizers were amazingly welcoming, and even allowed me to join in the speaker's dinner, and treated me like a regular guest.

    Here is a quick summary of what I learnt:

    1. Gelman's talk: The only thing I remember from his talk was the statement that when economists fit multiple regression models and find that one predictor's formerly significant effect was wiped out by adding another predictor, they think that the new predictor explains the old predictor. Which is pretty funny. Another funny thing was that he had absolutely no slides, and was drawing figures in the air, and apologizing for the low resolution of the figures.

     2. Bob Carpenter gave an inspiring talk on the exciting stuff that's coming in Stan:

    - Higher Speeds (Stan 2.10 will be 80 times faster with a 100 cores)

    - Stan book

    - New functionality (e.g., tuples, multivariate normal RNG)

    - Gaussian process models will soon become tractable

    - Blockless Stan is coming! This will make Stan code look more like JAGS (which is great). Stan will forever remain backward compatible so old code will not break.

    My conclusion was that in the next few years, things will improve a lot in terms of speed and in terms of what one can do.

    3. Torsten and Stan:

    - Torsten seems to be a bunch of functions to do PK/PD modeling with Stan.

    - Bill Gillespie on Torsten and Stan: https://www.metrumrg.com/wp-content/uploads/2018/05/BayesianPmetricsMBSW2018.pdf

    - Free courses on Stan and PK/PK modeling: https://www.metrumrg.com/courses/

    4. Mitzi Morris gave a great talk on disease mapping (accident mapping in NYC) using conditional autoregressive models (CAR). The idea is simple but great: one can model the correlations between neighboring boroughs. A straightforward application is in EEG, modeling data from all electrodes simultaneously, and modeling the decreasing correlation between neighbors. This is low-hanging fruit, esp. with Stan 2.18 coming.

    5. From Bob I learnt that one should never provide free consultation (I am doing that these days), because people don't value your time then. If you charge them by the hour, this sharpens their focus. But I feel guilty charging people for my time, especially in medicine, where I provide free consulting: I'm a civil servant and already get paid by the state, and I get total freedom to do whatever I like. So it seems only fair that I serve the state in some useful way (other than studying processing differences in subject vs object relative clauses, that is).

    For psycholinguists, there is a lot of stuff in pharmacometrics that will be important for EEG and visual world data: Gaussian process models, PK/PD modeling, spatial+temporal modeling of a signal like EEG. These tools exist today but we are not using them. And Stan makes a lot of this possible now or very soon now.

    Summary: I'm impressed.

    Friday, June 01, 2018

    Soliciting comments on paper

    I welcome comments and criticism on the following paper:

    Title: The statistical significance filter leads to overoptimistic expectations of replicability
    Authors: Vasishth, Mertzen, Jäger, Gelman


    Abstract: It is well-known in statistics (e.g., Gelman & Carlin, 2014) that treating a result as publishable just because the p-value is less than 0.05 leads to overop- timistic expectations of replicability. These overoptimistic expectations arise due to Type M(agnitude) error: when underpowered studies yield significant results, effect size estimates are guaranteed to be exaggerated and noisy. These effects get published, leading to an overconfident belief in replicability. We demonstrate the adverse consequences of this statistical significance filter by conducting seven direct replication attempts (268 participants in total) of a recent paper (Levy & Keller, 2013). We show that the published claims are so noisy that even non-significant results are fully compatible with them. We also demonstrate the contrast between such small-sample studies and a larger-sample study; the latter generally yields a less noisy estimate but also a smaller effect magnitude, which looks less compelling but is more realistic. We reiterate several suggestions from the methodology literature for improving best practices.

    You can download the pdf from here: https://osf.io/eyphj/