Shravan Vasishth's Slog (Statistics blog)

Friday, December 11, 2009

Statistics in linguistics

People in linguistics tend to treat statistical theory as something that can be outsourced--we don't really need to know anything about the details, we just need to know which button to click.

People easily outsource statistical knowledge in an empirical paper, but the same people would be appalled if they hired an assistant to work out the technical details of syntactic theory for a syntax paper.

The statistics *is* the science, it's not some extra appendage that can be outsourced.

Thursday, April 23, 2009

How to get ESS style indentation in textmate

This should be standard in Textmate, I don't know why one has to go through so many steps to get it working:

http://gragusa.wordpress.com/2007/11/11/textmate-emacs-like-indentation-for-r-files/

How to update R bundle in textmate

Got this from the web somewhere:

Just create a script with the following content:

#!/bin/sh

LC_CTYPE=en_US.UTF-8
SVN=`which svn`

echo Changing to Bundles directory...
mkdir -p /Library/Application\ Support/TextMate/Bundles
cd /Library/Application\ Support/TextMate/Bundles

if [ -d /Library/Application\ Support/TextMate/Bundles/R.tmbundle ]; then
echo R bundle already exists - updating...
$SVN up "R.tmbundle"
else
echo Checking out R bundle...
$SVN --username anon --password anon co http://macromates.com/svn/Bundles/trunk/Bundles/R.tmbundle/
fi

echo Reloading bundles in TextMate...
osascript -e 'tell app "TextMate" to reload bundles'

Wednesday, July 04, 2007

Selection bias in journal articles

Journals dealing in psycholinguistic research do not publish null results generally, because they are "inconclusive". So it's completely possible that out of 100 experiments, 95 are inconclusive, and 5 are "significant", but that all five are Type I errors. But it's those 5 experiments that will get published.

The naive rebuttal to this would be that such a situation would only rarely arise. But the non-obvious thing is that rare events do happen. If we published only those five articles, then how would we draw the conclusion that we are not in Type I la la land?

Saturday, April 28, 2007

Rlang mailing list

Roger Levy has created a possibly useful wiki for exchanging questions about the use of R for language research:

https://ling.ucsd.edu/mailman/listinfo.cgi/r-lang

Saturday, April 21, 2007

How to choose between a multiplicity of sexy models

It's websites like this that give model selection such a bad reputation in science:

http://www.modelselection.org/

Nice introduction to R

http://heather.cs.ucdavis.edu/~matloff/r.html

A blog most amazing

I just found a most astounding blog via Gelman's blog:

http://emotion.inrialpes.fr/~dangauthier/blog/

Tuesday, April 17, 2007

How to extract SEs from lmer fixed effects estimates

Extracting fixed effects coefficients from lmer is easy:

fixef(lmer.fit)

But extracting SEs of those coefficients is, well, trivial, but you have to know what to do. It's not obvious:

Vcov <- vcov(lmer.fit, useScale = FALSE)
se <- sqrt(diag(Vcov))

Saturday, February 17, 2007

Hmisc: how to increase magnification

One non-obvious thing (at least to me) about Hmisc's xYplot function is that to increase magnification or other parameters of a graph component, you have to do the following.

xlab=list("Condition",cex=2)

I.e., you have to make a list out of the parameter, and add whatever information you need. This works generally for any of the xYplot parameters.

Thursday, January 25, 2007

using winbugs with gelman and hill book on intel macs

I finally installed Windows on my Mac (a traumatic experience) and finally got the code working. However, the startup instructions on the website of the book did not work for me. I offer a working example for other souls as clueless as myself. The first problem is that the libraries have to be installed manually, they do not install automatically as adverstised. Second, the library R2WinBUGS has to be called explicitly to run the critical bugs command.
Also, if anyone out there is thinking of installing a dual boot environment in Mac in order to install WinBUGS, there is a bug (no pun intended) in the licence installation of WinBUGS. The decode command for the license does not work as advertised, but the license installs anyway.
The working version is here: http://www.ling.uni-potsdam.de/~vasishth/temp/schools2.R

Monday, January 22, 2007

Some expensive lessons I recently learnt about R/Sweave

1. If you are going to generate lots of latex tables automatically from an Rnw file, LABEL THEM.

2. weaver does not work with xYplot. If you are using the Hmisc library, just don't use weaver. I will present a solution here sometime soon.

The solution: set caching to off (cache=off) in the chunk that loads the Hmisc library and runs the xYplot command(s). You can turn caching on before and after the chunk, but xYplots need to be computed without caching.

3. xtable is unable to identify the fact that an R output line containing, e.g., log(sigma^2), has to be in math-environment in the tex. In Sweave this has the disastrous consequence that the .tex file does not compile. My kludgy solution is to search and replace the .tex file after Sweaving it.

It's frustrating that such good tools can sometimes be such a pain in the ass. I guess one should be grateful they are there at all.

Saturday, January 13, 2007

Incomplete Review of Gelman and Hill's Data Analysis using Regression and Multilevel/Hierarchical Models

I'm writing this somewhat cranky review as I read the book. Compared to the Pinheiro and Bates book, the examples in this book are initially irritatingly difficult to get working. A major problem with the book is that code involving BUGS runs only on Windows. This excludes readers like me from the action. So I have to wait until I get a Windows machine--but do I really want to start using Windows now? It would have been more helpful if their webpage prominently mentioned this detail (that the book is Windows specific). Had they done that I would probably not have bought it. But now that I have paid for it I am going to read it.

The website for the book has the data in a pretty disorganized way--why not just make a library? The authors do have a package for arm on the CRAN archive, but it does not install on any OS except Windows (the first R package I have seen with this property in my seven years as an R user). I tried to wget -r the ~gelman/arm/examples directory but ended up with all kinds of other crap in my directory as well, which was annoying. A zip archive could not hurt.

Chapters 1-3

I did not get a huge amount out of these chapters that was deeply interesting, but it is a good intro for newcomers to regression.

The code for the example in chapter 3 doesn't work on non-windows machines. Here is a working version.

Chapter 4

The book becomes more and more exciting from about this point onwards. Only one grouse:

Chapter 4 has some principles doing carrying out regression for prediction (section 4.6) but it is far from clear where they come from and the principles have a cookbookey feel (do this, don't do that, without explaining why). It would have been better if the authors had taught the reader to reason about the problem (surely those are the real principles, and the presented principles the consequences of the thought process generated by those principles).

[to be continued]

Thursday, January 04, 2007

Great statistics courses that use R

1. http://www.unc.edu/courses/2006spring/ecol/145/001/docs/lectures.htm

2. http://www.stat.washington.edu/vanduijn/560/

Statistical learning theory:

3. http://www.ece.rice.edu/~fk1/classes/ELEC697.htm

4. http://www.ulb.ac.be/di/map/gbonte/Stat104.html

Wednesday, January 03, 2007

Null hypotheses, significance testing and all that jazz

Some amazing articles I've recently read in my ample spare time:

1. The Insignificance of Null Hypothesis Significance Testing
Jeff Gill
Political Research Quarterly, Vol. 52, No. 3 (Sep., 1999), pp. 647-674
doi:10.2307/449153

2. Andrew Gelman's article

3. And this one: http://www.npwrc.usgs.gov/resource/methods/statsig/index.htm

4. Bowers and Gelman on Exploratory Data Analysis with Hierarchical Linear Models (AKA Multilevel models)

Suitably stunned into silence, the reader may then have the following practical question: how to present one's HPD intervals in a journal, and what else to present?

Here's an answer from Doug Bates.

Tuesday, January 02, 2007

Great article: EDA for HLMs

There's an interesting paper I just read that comes with Sweave/R code that the article uses:

http://www-personal.umich.edu/~jwbowers/papers.html

It's called EDA for HLMs, and advocates an exploratory data analysis when trying to understand data (as opposed to blindly searching for a yes/no answer, did significance fall below 0.05). In psycholinguistics, we are still a long way from conventional plodding along well beaten paths.

Installing weaver

It is not immediately obvious how one can install weaver (thanks to balajis for telling me about it--see his comment to the Sweave speed issues).

Do the following as superuser for a system-wide install:

1. install digest (from CRAN)
2. install codetools from http://bioconductor.org/packages/1.9/omegahat/html/codetools.html
3. install weaver from http://www.bioconductor.org/packages/1.9/bioc/html/weaver.html

A clunky way to install on Mac OS X and Linux is: as superuser do

R CMD INSTALL packagetarball

I'll add information on how to use it later, but one can always consult the weaver manual/vignette.

How I maintain my data

As soon as you have a lot of experiments floating around, you tend to get a proliferation of code and data files. Usually, chaos ensues. If someone asks you for the data of some experiment you published, you can (a) ignore the request (this is the most economical but also the least ethical response) (b) send them they can realistically use. This post is about how to carry out (b).

Example.

Suppose I have a collection of data and R code that I will cryptically call intml.

1. package.skeleton("intml")
This will create some directories (see previous post)
2. now add the data to the data directory
3. add the .Rnw file you used to analyze the data as a vignette

Build a package as indicated in the last post, and send it to the person who asked for it. Or make it available on CRAN.

It's simple and it enforces a certain self-discipline. Nothing like the knowledge that anyone can read your code to force you to write it properly :-)

Monday, January 01, 2007

Putting together a library

The Introduction to R and other documentation CRAN tells you how to build a library. But it is remarkably hard to find a step-by-step how-to for packaging together an R library. Here is a good one.

Wednesday, December 13, 2006

Sweave for complex projects (speed issues)

One problem my colleagues and I face is that our statistical analysis projects quickly become very complex, and recompiling Sweave becomes a slow process each time I update the code or just run it again.

I am slowly compiling a list of available solutions to this problem (the real issues are lack of speed, lack of modularity):

Here is what I have found so far:

1. Use \SweaveInput for including modular code
2. Use makefiles a la Deepayan Sarkar:
here
3. Another solution that relates to the present problem:
here
4. Finally, I think one should make vignettes/packages out of one's research projects so that the whole Rnw file does not need to be compiled--the needed objects can be made visible by doing something like:

library(mydata)

There is a bit of work involved in making the package, but the payoff is tremendous. The R documentation provides details on how to build packages, but maybe I will put a simple example here.