Friday, April 07, 2017

A comment on the DeLong et al 2005 nine-labs replication failure

Nieuwland et al replication attempts of DeLong at al. 2005

Note: Revised 8 April 2017 following comments from Nieuwland and others.

Recently, Nieuwland et al did an interesting series of replication attempts of the DeLong et al 2005 Nature Neuroscience paper. Nieuwland et al report a failure to replicate the original effect. This is just a quick note to suggest that it may be difficult to claim that this is a replication failure. I focus only on the article data.

First we load the Nieuwland et al data (available from https://osf.io/eyzaq).

articles <-read.delim("public_article_data.txt", quote="")
#head(articles)

Then make sure columns are correct data types, and get lab names, convert cloze to z-scores as Nieuwland et al did:

articles$item <-as.factor(articles$item)
articles$cloze <- as.numeric(as.character(articles$cloze))

labnames<-unique(articles$lab)

# Z-transform cloze and prestimulus interval
articles$zcloze <- scale(articles$cloze, center = TRUE, scale = TRUE)
articles$zprestim <- scale(articles$prestim, center = TRUE, scale = TRUE)

Then use lmer to compute estimates from each lab separately. I also fit ``maximal’’ models in Stan (HT: Barr et al 2013 ;) but nothing much changes so I use these varying slopes models without correlations:

library(lme4)

## Loading required package: Matrix

# models for article fit for each lab separately:
res<-matrix(rep(NA,9*2),ncol=2)
for(i in 1:9){
model <- lmer(base100 ~ zcloze + ( 1| subject) +
                (zcloze|| item), 
              data = subset(articles,lab==labnames[i]),
                control=lmerControl(
                  optCtrl=list(maxfun=1e5)))
res[i,]<-summary(model)$coefficients[2,1:2]
}

Next, we fit a random-effects meta-analysis using the nine-labs replications.

Random effects meta-analysis:

The model specification was as follows.

Assume that:

$y_i$ is the observed effect in microvolts in the $i$-th study with $i=1\dots n$;
$\theta$ is the true (unknown) effect, to be estimated by the model;
$\sigma_{i}^2$ is the true variance of the sampling distribution; each $\sigma_i$ is estimated from the standard error available from the study $i$;
The variance parameter $\tau^2$ represents between-study variance.

We can construct a random-effects meta-analysis model as follows:

\[\begin{equation} \begin{split} y_i \mid \theta_i,\sigma_i^2 \sim & N(\theta_i, \sigma_i^2) \quad i=1,\dots, n\\ \theta_i\mid \theta,\tau^2 \sim & N(\theta, \tau^2), \\ \theta \sim & N(0,10^2),\\ \tau \sim & N(0,10^2)T(0,) \hbox{(truncated normal)} \\ \end{split} \end{equation}\]

The priors shown above are just examples. Other priors can be used and are probably more suitable in this case. I use Gelman et al 2014’s non-centralized parameterization in any case in the Stan code below because of the sparse data we have here.

## Stan meta-analysis:
library(rstan)

## Loading required package: ggplot2

## Loading required package: StanHeaders

## rstan (Version 2.14.1, packaged: 2017-01-20 19:24:23 UTC, GitRev: 565115df4625)

## For execution on a local, multicore CPU with excess RAM we recommend calling
## rstan_options(auto_write = TRUE)
## options(mc.cores = parallel::detectCores())

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

dat<-list(y=res[,1],n=length(res[,1]),s=res[,2])

## is this the DeLong et al mean and se? Could add it to meta-analysis if original data ever becomes available.
#dat$y<-c(dat$y,0.4507929)
#dat$s<-c(dat$s,.5)
#dat$n<-dat$n+1

fit <- stan(file='rema2.stan', data=dat,
            iter=2000, chains=4, seed=987654321,
            control = list(adapt_delta = 0.99))

paramnames<-c("mu","tau","theta")
print(fit,pars=paramnames,prob=c(0.025,0.975))

## Inference for Stan model: rema2.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##          mean se_mean   sd  2.5% 97.5% n_eff Rhat
## mu       0.11       0 0.07 -0.04  0.25  2153    1
## tau      0.11       0 0.08  0.00  0.31  1496    1
## theta[1] 0.11       0 0.10 -0.09  0.31  4000    1
## theta[2] 0.13       0 0.10 -0.08  0.36  4000    1
## theta[3] 0.10       0 0.11 -0.13  0.33  4000    1
## theta[4] 0.10       0 0.11 -0.13  0.30  4000    1
## theta[5] 0.03       0 0.11 -0.23  0.22  4000    1
## theta[6] 0.06       0 0.13 -0.25  0.27  4000    1
## theta[7] 0.14       0 0.11 -0.05  0.39  4000    1
## theta[8] 0.18       0 0.11  0.00  0.42  4000    1
## theta[9] 0.15       0 0.11 -0.04  0.40  4000    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Apr  8 19:51:28 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

params<-extract(fit,pars=paramnames)

Results

The interesting thing here is the posterior distribution of $\mu$, the effect of cloze probability on the N400.

## posterior probability that mu is > 0:
mean(params$mu>0)

## [1] 0.9345

mean(params$mu)

## [1] 0.1084961

## lower and upper bounds of credible intervals:
lower<-quantile(params$mu,prob=0.025)
upper<-quantile(params$mu,prob=0.975)
SE<-(upper-lower)/4 ## approximate, assuming symmetry which seems reasonable here
hist(params$mu,main="Posterior distribution of effect \n Article",xlab="mu",freq=F)
abline(v=0)
arrows(x0=lower,y0=1,x1=upper,y1=1,angle=90,code=3)

Thus, even in the Nieuwland et al nine-labs data, the N400 amplitude decreases (becomes more positive) with increasing cloze probability. The posterior probability that the coefficient for cloze is greater than 0 is 94%. From this, one cannot really say that the Nieuwland et al studies are a failure to replicate the DeLong study.

The next step would be to compare the original DeLong et al data using the same analysis done above. But I was unable to get the original data up until now, so I’m posting my comment on my blog.

Prestimulus interval

Mante Nieuwland pointed out to me that the effect can be seen already in the prestimulus interval. Here is the evidence for that.

# models for article fit for each lab separately:
res<-matrix(rep(NA,9*2),ncol=2)
for(i in 1:9){
model <- lmer(prestim ~ zcloze + ( 1| subject) +
                (zcloze|| item), 
              data = subset(articles,lab==labnames[i]),
                control=lmerControl(
                  optCtrl=list(maxfun=1e5)))
res[i,]<-summary(model)$coefficients[2,1:2]
}

dat<-list(y=res[,1],n=length(res[,1]),s=res[,2])

## is this the DeLong et al mean and se? Could add it to meta-analysis if original data ever becomes available.
#dat$y<-c(dat$y,0.4507929)
#dat$s<-c(dat$s,.5)
#dat$n<-dat$n+1

fit <- stan(file='rema2.stan', data=dat,
            iter=2000, chains=4, seed=987654321,
            control = list(adapt_delta = 0.99))

paramnames<-c("mu","tau","theta")
print(fit,pars=paramnames,prob=c(0.025,0.975))

## Inference for Stan model: rema2.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##           mean se_mean   sd  2.5% 97.5% n_eff Rhat
## mu        0.07       0 0.07 -0.06  0.21  1497    1
## tau       0.12       0 0.08  0.01  0.30   897    1
## theta[1]  0.18       0 0.11  0.00  0.41  2632    1
## theta[2] -0.01       0 0.11 -0.26  0.17  2042    1
## theta[3]  0.11       0 0.11 -0.10  0.36  4000    1
## theta[4]  0.13       0 0.10 -0.04  0.34  4000    1
## theta[5]  0.03       0 0.09 -0.16  0.20  4000    1
## theta[6]  0.01       0 0.11 -0.23  0.20  4000    1
## theta[7]  0.03       0 0.10 -0.19  0.22  4000    1
## theta[8]  0.09       0 0.09 -0.08  0.28  4000    1
## theta[9]  0.11       0 0.09 -0.06  0.29  4000    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Apr  8 19:51:33 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

params<-extract(fit,pars=paramnames)

## posterior probability that mu is > 0:
mean(params$mu>0)

## [1] 0.88175

mean(params$mu)

## [1] 0.07359659

## lower and upper bounds of credible intervals:
lower<-quantile(params$mu,prob=0.025)
upper<-quantile(params$mu,prob=0.975)
SE<-(upper-lower)/4 ## approximate, assuming symmetry which seems reasonable here
hist(params$mu,main="Posterior distribution of effect \n Prestimulus interval",xlab="mu",freq=F)
abline(v=0)
arrows(x0=lower,y0=1,x1=upper,y1=1,angle=90,code=3)

Retrospective power calculation

We now compute retrospective power and Type S/M errors:

## Gelman and Carlin function:
retrodesign <- function(A, s, alpha=.05, df=Inf, n.sims=10000){
  z <- qt(1-alpha/2, df)
  p.hi <- 1 - pt(z-A/s, df)
  p.lo <- pt(-z-A/s, df)
  power <- p.hi + p.lo
  typeS <- p.lo/power
  estimate <- A + s*rt(n.sims,df)
  significant <- abs(estimate) > s*z
  exaggeration <- mean(abs(estimate)[significant])/A
  return(list(power=power, typeS=typeS, exaggeration=exaggeration))
}

retrodesign(.11, SE)

## $power
##     97.5% 
## 0.3761173 
## 
## $typeS
##        97.5% 
## 0.0004168559 
## 
## $exaggeration
## [1] 1.616742

The bad news here is that if the meta-analysis posterior mean for the article is the true effect, then power is about 34%.

If we want about 80% power, we need to halve that SE:

retrodesign(.11, 0.072/2)

## $power
## [1] 0.8633715
## 
## $typeS
## [1] 3.063012e-07
## 
## $exaggeration
## [1] 1.079705

I think that implies that you need four times as many subjects than you had.

Acknowledgements

Thanks to Mante Nieuwland and Stephen Politzer-Ahles for patiently explaining a lot of details to me, and for sharing their data and code.

Appendix: Stan code

Here is the meta-analysis Stan code with the Gelman-style reparameterization. y_tilde is not used above, but yields posterior predictive values, and could be used profitably in future replication attempts.

data {
    int<lower=0> n; //number of studies
    real y[n]; // estimated effect
    real<lower=0> s[n]; // SEs of effect
}
parameters{
    real mu; //population mean
    real<lower=0> tau;  // between study variability    
    vector[n] eta; //study level errors
} 
transformed parameters {
    vector[n] theta;   // study effects
    theta = mu + tau*eta;
}
model {
eta ~ normal(0,1);
y ~ normal(theta,s);
}
generated quantities{
vector[n] y_tilde;
for(i in 1:n){
  y_tilde[i] = normal_rng(theta[i],s[i]); 
  }
}

Monday, March 27, 2017

Fitting Bayesian Linear Mixed Models for continuous and binary data using Stan: A quick tutorial

I want to give a quick tutorial on fitting Linear Mixed Models (hierarchical models) with a full variance-covariance matrix for random effects (what Barr et al 2013 call a maximal model) using Stan.

For a longer version of this tutorial, see: Sorensen, Hohenstein, Vasishth, 2016.

Prerequisites: You need to have R and preferably RStudio installed; RStudio is optional. You need to have rstan installed. See here. I am also assuming you have fit lmer models like these before:

lmer(log(rt) ~ 1+RCType+dist+int+(1+RCType+dist+int|subj) + (1+RCType+dist+int|item), dat)

If you don't know what the above code means, first read chapter 4 of my lecture notes.

The code and data format needed to fit LMMs in Stan

The data

I assume you have a 2x2 repeated measures design with some continuous measure like reading time (rt) data and want to do a main effects and interaction contrast coding. Let's say your main effects are RCType and dist, and the interaction is coded as int. All these contrast codings are $\pm 1$. If you don't know what contrast coding is, see these notes and read section 4.3 (although it's best to read the whole chapter). I am using an excerpt of an example data-set from Husain et al. 2014.

"subj" "item" "rt""RCType" "dist" "int"
1       14    438  -1        -1      1
1       16    531   1        -1     -1
1       15    422   1         1      1
1       18   1000  -1        -1      1 
...

Assume that these data are stored in R as a data-frame with name rDat.

The Stan code

Copy the following Stan code into a text file and save it as the file matrixModel.stan. For continuous data like reading times or EEG, you never need to touch this file again. You will only ever specify the design matrix X and the structure of the data. The rest is all taken care of.

data {
  int N;               //no trials
  int P;               //no fixefs
  int J;               //no subjects
  int n_u;             //no subj ranefs
  int K;               //no items
  int n_w;             //no item ranefs
  int subj[N]; //subject indicator
  int item[N]; //item indicator
  row_vector[P] X[N];           //fixef design matrix
  row_vector[n_u] Z_u[N];       //subj ranef design matrix
  row_vector[n_w] Z_w[N];       //item ranef design matrix
  vector[N] rt;                 //reading time
}

parameters {
  vector[P] beta;               //fixef coefs
  cholesky_factor_corr[n_u] L_u;  //cholesky factor of subj ranef corr matrix
  cholesky_factor_corr[n_w] L_w;  //cholesky factor of item ranef corr matrix
  vector[n_u] sigma_u; //subj ranef std
  vector[n_w] sigma_w; //item ranef std
  real sigma_e;        //residual std
  vector[n_u] z_u[J];           //spherical subj ranef
  vector[n_w] z_w[K];           //spherical item ranef
}

transformed parameters {
  vector[n_u] u[J];             //subj ranefs
  vector[n_w] w[K];             //item ranefs
  {
    matrix[n_u,n_u] Sigma_u;    //subj ranef cov matrix
    matrix[n_w,n_w] Sigma_w;    //item ranef cov matrix
    Sigma_u = diag_pre_multiply(sigma_u,L_u);
    Sigma_w = diag_pre_multiply(sigma_w,L_w);
    for(j in 1:J)
      u[j] = Sigma_u * z_u[j];
    for(k in 1:K)
      w[k] = Sigma_w * z_w[k];
  }
}

model {
  //priors
  beta ~ cauchy(0,2.5);
  sigma_e ~ cauchy(0,2.5);
  sigma_u ~ cauchy(0,2.5);
  sigma_w ~ cauchy(0,2.5);
  L_u ~ lkj_corr_cholesky(2.0);
  L_w ~ lkj_corr_cholesky(2.0);
  for (j in 1:J)
    z_u[j] ~ normal(0,1);
  for (k in 1:K)
    z_w[k] ~ normal(0,1);
  //likelihood
  for (i in 1:N)
    rt[i] ~ lognormal(X[i] * beta + Z_u[i] * u[subj[i]] + Z_w[i] * w[item[i]], sigma_e);
}

Define the design matrix

Since we want to test the main effects coded as the columns RCType, dist, and int, our design matrix will look like this:

# Make design matrix
X <- unname(model.matrix(~ 1 + RCType + dist + int, rDat))
attr(X, "assign") <- NULL

Prepare data for Stan

Stan expects the data in a list form, not as a data frame (unlike lmer). So we set it up as follows:

# Make Stan data
stanDat <- list(N = nrow(X),
P = ncol(X),
n_u = ncol(X),
n_w = ncol(X),
X = X,
Z_u = X,
Z_w = X,
J = nlevels(rDat$subj),
K = nlevels(rDat$item),
rt = rDat$rt,
subj = as.integer(rDat$subj),
item = as.integer(rDat$item))

Load library rstan and fit Stan model

library(rstan) 
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

# Fit the model
matrixFit <- stan(file = "matrixModel.stan", data = stanDat,
iter = 2000, chains = 4)

Examine posteriors

print(matrixFit)

This print output is overly verbose. I wrote a simple function to get the essential information quickly.

stan_results<-function(m,params=paramnames){
  m_extr<-extract(m,pars=paramnames)
  par_names<-names(m_extr)
  means<-lapply(m_extr,mean)
  quantiles<-lapply(m_extr,
                    function(x)quantile(x,probs=c(0.025,0.975)))
  means<-data.frame(means)
  quants<-data.frame(quantiles)
  summry<-t(rbind(means,quants))
  colnames(summry)<-c("mean","lower","upper")
  summry
}

For example, if I want to see only the posteriors of the four beta parameters, I can write:

stan_results(matrixFit, params=c("beta[1]","beta[2]","beta[3]","beta[4]"))

For more details, such as interpreting the results and computing things like Bayes Factors, see Nicenboim and Vasishth 2016.

FAQ: What if I don't want to fit a lognormal?

In the Stan code above, I assume a lognormal function for the reading times:

 rt[i] ~ lognormal(X[i] * beta + Z_u[i] * u[subj[i]] + Z_w[i] * w[item[i]], sigma_e);

If this upsets you deeply and you want to use a normal distribution (and in fact, for EEG data this makes sense), go right ahead and change the lognormal to normal:

 rt[i] ~ normal(X[i] * beta + Z_u[i] * u[subj[i]] + Z_w[i] * w[item[i]], sigma_e);

FAQ: What if I my dependent measure is binary (0,1) responses?

Use this Stan code instead of the one shown above. Here, I assume that you have a column called response in the data, which has 0,1 values. These are the trial level binary responses.

data {
  int N;               //no trials
  int P;               //no fixefs
  int J;               //no subjects
  int n_u;             //no subj ranefs
  int K;               //no items
  int n_w;             //no item ranefs
  int subj[N]; //subject indicator
  int item[N]; //item indicator
  row_vector[P] X[N];           //fixef design matrix
  row_vector[n_u] Z_u[N];       //subj ranef design matrix
  row_vector[n_w] Z_w[N];       //item ranef design matrix
  int response[N];                 //response
}

parameters {
  vector[P] beta;               //fixef coefs
  cholesky_factor_corr[n_u] L_u;  //cholesky factor of subj ranef corr matrix
  cholesky_factor_corr[n_w] L_w;  //cholesky factor of item ranef corr matrix
  vector[n_u] sigma_u; //subj ranef std
  vector[n_w] sigma_w; //item ranef std
  vector[n_u] z_u[J];           //spherical subj ranef
  vector[n_w] z_w[K];           //spherical item ranef
}

transformed parameters {
  vector[n_u] u[J];             //subj ranefs
  vector[n_w] w[K];             //item ranefs
  {
    matrix[n_u,n_u] Sigma_u;    //subj ranef cov matrix
    matrix[n_w,n_w] Sigma_w;    //item ranef cov matrix
    Sigma_u = diag_pre_multiply(sigma_u,L_u);
    Sigma_w = diag_pre_multiply(sigma_w,L_w);
    for(j in 1:J)
      u[j] = Sigma_u * z_u[j];
    for(k in 1:K)
      w[k] = Sigma_w * z_w[k];
  }
}

model {
  //priors
  beta ~ cauchy(0,2.5);
  sigma_u ~ cauchy(0,2.5);
  sigma_w ~ cauchy(0,2.5);
  L_u ~ lkj_corr_cholesky(2.0);
  L_w ~ lkj_corr_cholesky(2.0);
  for (j in 1:J)
    z_u[j] ~ normal(0,1);
  for (k in 1:K)
    z_w[k] ~ normal(0,1);
  //likelihood
  for (i in 1:N)
    response[i] ~ bernoulli_logit(X[i] * beta + Z_u[i] * u[subj[i]] + Z_w[i] * w[item[i]]);
}

For reproducible example code

See here.

Tuesday, November 22, 2016

Statistics textbooks written by non-statisticians: Generally a Bad Idea

A methodologist from psychology called Russell Warne writes on twitter:

Paragraph in my upcoming introductory #statistics textbook that I wouldn't have written if I hadn't followed @R__INDEX on Twitter. pic.twitter.com/Bup9qcFzJg
— Russell Warne (@Russwarne) November 22, 2016

It is of course correct that you can usually increase power by increasing sample size.

But a lot of the other stuff in this paragraph is wrong or misleading. If this is an introductory statistics textbook for psychologists, it will cause a lot of harm: a whole new generation of psychologists will emerge with an incorrect understanding of the frequentist point of view to inference. Here are some comments on his text:

"When a study has low statistical power, it raises the possibility that any rejection of the null hypothesis is just a fluke, i.e., a Type I error": A fluke rejection of a null hypothesis, isn't that the definition of Type I error? So, low power raises the possibility that a rejection is a Type I error? There is so much wrong here. First of all, Type I error is associated with hypothetical replications of the experiment. It is a statement about the long run repetitions of the procedure, not about the specific experiment you did. You cannot talk of a particular result being a "Type I error" or not. Second, the above sentence says that if power is low, you could end up with an incorrect rejection; the implication is that if power is high, I am unlikely to end up with an incorrect rejection! What the author should have said is that when power is low, by definition the probability of correctly detecting the effect is low. Punkt. Furthermore, the much more alarming consequence of low power is Type S and M errors (see my next point below). I'm surprised that psychologists haven't picked this up yet.
When power is low, "...the study should most likely not have been able to reject the null hypothesis at all. So, when it does reject the null hypothesis, it does not seem like a reliable result": I think that one word that should be banned in psych* is "reliable", it gives people the illusion that they found out something that is true. It is never going to be the case that you can say with 100% certainty that you found out the truth. If reliable means "true, reflecting reality correctly", you will *never* know that you have a reliable result. The trouble with using words like reliable is when people read a sentence like the one above and then try to construct the meaning of the sentence by considering the converse situation, when power is high. The implication is that when power is high, the rejection of the result is "reliable". I have lost count of how many times I have heard psych* people telling me that a result is "reliable", implying that they found something that is true of nature. Even when power is high, you still have a Type I error of whatever your $\alpha$ is. So any individual result you get could be an incorrect rejection; it doesn't matter what you think the power is. A further important point is: how do you *know* what power you have? Due to Type S and M errors, you are most likely doing your calculation based on previous, underpowered studies. You are therefore going to be getting gross overestimates of power anyway. Power is a function, and typically, you will have a lot of uncertainty associated with your estimate of the plausible values of power under different assumptions (after all, you don't *know* what the true effect is, right? If you know already, why are you doing the study?). Giving a student the false security of saying "oh, I have high power, so my result is reliable" is pretty irresponsible and is part of the reason why we keep messing up again and again and again.

Tuesday, August 02, 2016

Two papers, with code: Statistical Methods for Linguistic Research (Parts 1 and 2)

Here are two papers that may be useful for researchers in psychology, linguistics, and cognitive science:

Shravan Vasishth and Bruno Nicenboim. Statistical methods for linguistic research: Foundational Ideas - Part I. Language and Linguistics Compass, 2016. In Press.
PDF: http://bit.ly/VasNicPart1
Code: http://bit.ly/VasNicPart1Code
Bruno Nicenboim and Shravan Vasishth. Statistical methods for linguistics research: Foundational Ideas - Part II. Language and Linguistics Compass, 2016. In Press.
PDF: http://bit.ly/NicVasPart2
Code: http://bit.ly/NicVasPart2Code

Wednesday, April 27, 2016

A simple proof that the p-value distribution is uniform when the null hypothesis is true

[Scroll to graphic below if math doesn't render for you]

Thanks to Mark Andrews for correcting some crucial typos (I hope I got it right this time!).

Thanks also to Andrew Gelman for pointing out that the proof below holds only when the null hypothesis is a point null $H_0: \mu = 0$, and the dependent measure is continuous, such as reading time in milliseconds, or EEG responses.

Someone asked this question in my linear modeling class: why is it that the p-value has a uniform distribution when the null hypothesis is true? The proof is remarkably simple (and is called the probability integral transform).

First, notice that when a random variable Z comes from a $Uniform(0,1)$ distribution, then the probability that $Z$ is less than (or equal to) some value $z$ is exactly $z$: $P(Z\leq z)=z$.

Next, we prove the following proposition:

Proposition:
If a random variable $Z=F(T)$, then $Z \sim Uniform(0,1)$.

Note here that the p-value is a random variable, call it $Z$. The p-value is computed by calculating the probability of seeing a t-statistic or something more extreme under the null hypothesis. The t-statistic comes from a random variable $T$ that is a transformation of the random variable $\bar{X}$: $T=(\bar{X}-\mu)/(\sigma/\sqrt{n})$. This random variable T has a CDF $F$.

So, if we can prove the above proposition, we have shown that the p-value's distribution under the null hypothesis is $Uniform(0,1)$.

Proof:

Let $Z=F(T)$.

$P(Z\leq z) = P(F(T)\leq z) = P(F^{-1} F(T) \leq F^{-1}(z) )
= P(T \leq F^{-1} (z) )
= F(F^{-1}(z))= z$.

Since $P(Z\leq z)=z$, Z is uniformly distributed, that is, Uniform(0,1).

A screengrab in case the above doesn't render:

Sunday, January 17, 2016

Automating R exercises and exams using the exams package

It's a pain to design statistics exercises each semester, and because students from previous share old exercises with the new incoming students, it's hard to design simple exercises that students haven't already seen the answers to. On top of that, some students try to cheat during the exam by looking over the shoulder of their neighbors. Homework exercises almost always involve collaboration even if you prohibit it.

It turns out that you can automate the generation of fixed-format exercises (with different numerical answers being required each time). You can also randomly select questions from a question bank you create yourself. And you can even create a unique question paper for each student in an exam, to make cheating between neighbors essentially impossible (even if they copy the correct answer to question 2 from a neighbor, they end up answering the wrong question on their own paper).

All this magic is made possible by the exams package in R. The documentation is of course comprehensive and there is a journal article explaining everything:

Achim Zeileis, Nikolaus Umlauf, Friedrich Leisch (2014). Flexible Generation of E-Learning Exams in R: Moodle Quizzes, OLAT Assessments, and Beyond. Journal of Statistical Software 58(1), 1-36. URL http://www.jstatsoft.org/v58/i01/.

I also use this package to deliver auto-graded exercises to students over datacamp.com. See here for the course I teach, and here for the datacamp exercises.

Here is a quick example to get people started on designing their own customized, automated exams. In my example below, there are several files you need.

1. The template files for your exam (what your exam or homework sheet will look like), and the solutions file. I provide two example files: test.tex and solutiontest.tex

2. The exercises or exam questions themselves: I provide two as examples. The first file is called pnorm1.Rnw. It's an Sweave file, and it contains the code for generating a random problem and for generating its solution. The code should be self-explanatory. The second file is called sesamplesize1multiplechoice.Rnw and has a multiple choice question.

3. The exam generating R code file: The code is commented and self-explanatory. It will generate the exercises, randomize the order of presentation (if there are two or more exercises), and generate a solutions file. The output will be a single or multiple exam papers (depending on how many versions you wanted generated), and the solutions file(s). Notice the cool thing that even in my example, with only one question, the two versions of the exams have different numbers, so two people cannot collaborate and consult each other and just write down one answer. Each student could in principle be given a unique set of exercises, although it would be a lot of work to grade it if you do it manually.

Here is the exam generating code:

Save from the gists given above (a) the test.tex and solutiontest.tex files, (b) the Rnw files containing the exercise (pnorm1.Rnw, and sesamplesize1multiplechoice.Rnw), and (c) the exam generating code (ExampleExamCode.R). Put all of these into your working directory, say ExampleExam. Then run the R code, and be amazed.

If something is broken in my example, please let me know.

Shuffling questions: If you want to reorder the questions in each run of the R code, just change myexamlist to sample(myexamlist) in the call below that appears in the file ExampleExamCode.R:

sol <- exams(sample(myexamlist), n = num.versions, 
             dir = odir, template = c("test", "solutiontest"),
             nsamp=1,
             header = list(ID = getID, Date = Sys.Date()))

Wednesday, January 06, 2016

My MSc thesis: A meta-analysis of relative clause processing in Mandarin Chinese using bias modelling

Here is my MSc thesis, which was submitted to the University of Sheffield in September 2015.

The pdf is here.

Title: A Meta-analysis of Relative Clause Processing in Mandarin Chinese using Bias Modelling

Abstract
The reading difficulty associated with Chinese relative clauses presents an important empirical problem for psycholinguistic research on sentence comprehension processes. Some studies show that object relatives are easier to process than subject relatives, while others show the opposite pattern. If Chinese has an object relative advantage, this has important implications for theories of reading comprehension. In order to clarify the facts about Chinese, we carried out a Bayesian random-effects meta-analysis using 15 published studies; this analysis showed that the posterior probability of a subject relative advantage is approximately $0.77$ (mean $16$, 95% credible intervals $-29$ and $61$ ms). Because the studies had significant biases, it is possible that they may have confounded the results. Bias modelling is a potentially important tool in such situations because it uses expert opinion to incorporate the biases in the model. As a proof of concept, we first identified biases in five of the fifteen studies, and elicited priors on these using the SHELF framework. Then we fitted a random-effects meta-analysis, including priors on biases. This analysis showed a stronger posterior probability ($0.96$) of a subject relative advantage compared to the standard random-effects meta-analysis (mean $33$, credible intervals $-4$ and $71$).

Saturday, December 19, 2015

Best statistics-related comment ever from a reviewer

This is the most interesting comment I have ever received from CUNY conference reviewing. It nicely illustrates the state of our understanding of statistical theory in psycholinguistics:

"I had no idea how many subjects each study used. Were just one or two people
used? ... Generally, I wasn't given enough data to determine my confidence in the provided t-values (which depends on the degrees of freedom involved)."

Thursday, August 27, 2015

Five thirty-eight provides a brand new definition of the p-value

The Five-Thirty Eight blog provides a brand new definition of the p-value:
http://fivethirtyeight.com/datalab/psychology-is-starting-to-deal-with-its-replication-problem/?ex_cid=538twitter

"A p-value is simply the probability of getting a result at least as extreme as the one you saw if your hypothesis is false."

I thought this blog was run by Nate Silver, a statistician?

Observed vs True Statistical Power, and the power inflation index

People (including me) routinely estimate statistical power for future studies using a pilot study's data or a previously published study's data (or perhaps using the predictions from a computational model, such as Engelmann et al 2015).

Indeed, the author of the Replicability Index has been using observed power to determine the replicability of journal articles. His observed power estimates are HUGE (in the range of 0.75) and seem totally implausible to me, given the fact that I can hardly ever replicate my studies.

This got me thinking: Gelman and Carlin have shown that when power is low, Type M error will be high. That is, the observed effects will tend to be highly exaggerated. The issue with Type M error is easy to visualize.

Suppose that a particular study has standard error 46, and sample size 37; this implies that standard deviation is $46\times \sqrt{37}= 279$. These are representative numbers from psycholinguistic studies. Suppose also that we know that the true effect (the absolute value, say on the millisecond scale for a reading study---thanks to Fred Hasselman) is D=15. Then, we can compute Type S and Type M errors for replications of this particular study by repeatedly sampling from the true distribution.

We can visualize the exaggerated effects under low power as follows (see below): On the x-axis you see the effect magnitudes, and on the y-axis is power. The red line is the power line of 0.20, which based on my own attempts at replicating my own studies (and mostly failing), I estimate to be an upper bound of the power of experiments in psycholinguistics (this is an upper bound, I think a more common value will be closer to 0.05). All those dots below the red line are exaggerated estimates in a low power situation, and if you were to use any of those points to estimate observed power, you would get a wildly optimistic power estimate which has no bearing with reality.

What does this fact about Type M error imply for Replicability Index's calculations? It implies that if power is in fact very low, and if journals are publishing larger-than-true effect sizes (and we know that they have an incentive to do so, because editors and reviewers mistakenly think that lower p-values, i.e., bigger absolute t-values, give stronger evidence for the specific alternative hypothesis of interest), then Replicability Index is possibly hugely overestimating power, and therefore hugely overestimating replicability of results.

I came up with the idea of framing this overestimation in terms of Type M error by defining something called a power inflation index. Here is how it works. For different "true" power levels, we repeatedly sample data, and compute observed power each time. Then, for each "true" power level, we can compute the ratio of the observed power to the true power in each case. The mean of this ratio is the power inflation index, and the 95% confidence interval around it gives us an indication (sorry Richard Morey! I know I am abusing the meaning of CI here and treating it like a credible interval!) of how badly we could overestimate power from a small sample study.

Here is the code for simulating and visualizing the power inflation index:

And here is the visualization:

What we see here is that if true power is as low as 0.05 (and we can never know that it is not since we never know the true effect size!), then using observed power can lead to gross overestimates by a factor of approximately 10! So, if Replicability Index reports an observed power of 0.75, what he might actually be looking at is an inflated estimate where true power is 0.08.

In summary, we can never know true power, and if we are estimating it using observed power conditional on true power being extremely low, we are likely to hugely overestimate power.

One way to test my claim is to actually try to replicate the studies that Replicability Index predicts has high replicability. My prediction is that his estimates will be wild overestimates and most studies will not replicate.

Postscript

A further thing that worries me about Replicability Index is his sloppy definitions of statistical terms. Here is how he defines power:

"Power is defined as the long-run probability of obtaining significant results in a series of exact replication studies. For example, 50% power means that a set of 100 studies is expected to produce 50 significant results and 50 non-significant results."

[Thanks to Karthik Durvasula for correcting my statement below!]
By not defining power of a test of a null hypothesis $H_0: \mu=k$, as the probability of rejecting the null hypothesis (as a function of different alternative $\mu$ such that $\mu\neq k$) when it is false, what this definition literally means is that if I sample from any distribution, including one where $\mu=0$, the probability of obtaining a significant result under repeated sampling is the power. Which of course is completely false.

Post-Post Script

Replicability Index points out in a tweet that his post-hoc power estimation corrects for inflation. But post-hoc power corrected for inflation requires knowledge of the true power, which is what we are trying to get at in the first place. How do you deflate "observed" power when you don't know what the true power is? Maybe I am missing something.

Monday, August 17, 2015

Some reflections on teaching frequentist statistics at ESSLLI 2015

I spent the last two weeks teaching frequentist and Bayesian statistics at the European Summer School in Logic, Language, and Information (ESSLLI) in Barcelona, at the beautiful and centrally located Pompeu Fabra University. The course web page for the first week is here, and the web page for the second course is here.

All materials for the first week are available on github, see here.

The frequentist course went well, but the Bayesian course was a bit unsatisfactory; perhaps my greater experience in teaching the frequentist stuff played a role (I have only taught Bayes for three years). I've been writing and rewriting my slides and notes for frequentist methods since 2002, and it is only now that I can present the basic ideas in five 90 minute lectures; with Bayes, the presentation is more involved and I need to plan more carefully, interspersing on-the-spot exercises to solidify ideas. I will comment on the Bayesian Data Analysis course in a subsequent post.

The first week (five 90 minute lectures) covered the basic concepts in frequentist methods. The audience was amazing; I wish I always had students like these in my classes. They were attentive, and anticipated each subsequent development. This was the typical ESSLLI crowd, and this is why teaching at ESSLLI is so satisfying. There were also several senior scientists in the class, so hopefully they will go back and correct the misunderstandings among their students about what all this Null Hypothesis Significance Testing stuff gives you (short answer: it answers *a* question very well, but it's the wrong question, nothing that is relevant to your research question).

I won't try to summarize my course, because the web page is online and you can also do exercises on datacamp to check your understanding of statistics (see here). You get immediate feedback on your attempts.

Stepping away from the technical details, I tried to make three broad points:

First, I spent a lot of time trying to clarify what a p-value is and isn't, focusing particularly on the issue of Type S and Type M errors, which Gelman and Carlin have discussed in their excellent paper.

Here is the way that I visualized the problems of Type S and Type M errors:

What we see here is repeated samples from a Normal distribution with true mean 15 and a typical standard deviation seen in psycholinguistic studies (see slide 42 of my slides for lecture2). The horizontal red line marks the 20% power line; most psycholinguistic studies fall below that line in terms of power. The dramatic consequence of this low power is the hugely exaggerated effects (which tend to get published in major journals because they also have low p-values) and the remarkable proportion of cases where the sample mean is on the wrong side of the true value 15. So, you are roughly equally likely to get a significant effect with a sample mean smaller to much smaller than the true mean, or larger and much larger than the true mean. With low power, regardless of whether you get a significant result or not, if power is low, and it is in most studies I see in journals, you are just farting in a puddle.

It is worth repeating this: once one considers Type S and Type M errors, even statistically significant results become irrelevant, if power is low. It seems like these ideas are forever going to be beyond the comprehension of researchers in linguistics and psychology, who are trained to make binary decisions based on p-values, weirdly accepting the null if p is greater than 0.05 and, just as weirdly, accepting their favored alternative if p is less than 0.05. The p-value is a truly interesting animal; it seems that a recent survey of some 400 Spanish psychologists found that, despite their being active in the field for quite a few years on average, they had close to zero understanding of what a p-value gives you. Editors of top journals in psychology routinely favor lower p-values, because they mistakenly think this makes "the result" more convincing; "the result" is the favored alternative. So even seasoned psychologists (and I won't even get started with linguists, because we are much, much worse), with decades of experience behind them, often have no idea what the p-value actually tells you.

A remarkable misunderstanding regarding p-values is the claim that it tells you whether the effect was "by chance". Here is an example from Replication Index's blog:

"The Test of Insufficient Variance (TIVA) shows that the variance in z-scores is less than 1, but the probability of this event to occur by chance is 10%, Var(z) = .63, Chi-square (df = 11) = 17.43, p = .096."

Here is another example from a self-published manuscript by Daniel Ezra Johnson:

"If we perform a likelihood-ratio test, comparing the model with gender to a null model with no predictors, we get a p-value of 0.0035. This implies that it is very unlikely that the observed gender difference is due to chance."

One might think that the above examples are not peer-reviewed, and that peer review would catch such mistakes. But even people explaining p-values in publications are unable to understand that this is completely false. An example is Keith Johnson's textbook, Quantitative Methods in Linguistics, which repeatedly talks about "reliable effects" and effects which are and are not due to chance. It is no wonder that the poor psychologist/linguist thinks, ok, if the p-value is telling me the probability that the effect is due to chance, and if the p-value is low, then the effect is not due to chance and the effect must be true. The mistake here is that the p-value is telling you the probability of the result being "due to chance" conditional on the null hypothesis being true. It's better to explain the p-value as the probability of getting the statistic (e.g., t-value) or something more extreme, under the assumption that the null hypothesis is true. People seem to drop the italicized part and this starts to propagate the misunderstanding for future generations. To repeat, the p-value is a conditional probability, but most people interpret it as an unconditional probability because they drop the phrase "under the null hypothesis" and truncate the statement to be about effects being due to chance.

Another bizarre thing I have repeatedly seen is misinterpreting the p-value as Type I error. Type I error is fixed at a particular value (0.05) before you run the experiment, and is the probability of your incorrectly rejecting the null when it's true, under repeated sampling. The p-value is what you get from your single experiment and is the conditional probability of your getting the statistic you got or something more extreme, assuming that the null is true. Even this point is beyond comprehension for psychologists (and of course linguists). Here is a bunch of psychologists explaining in an article why a p=0.000 should not be reported as an exact value:

"p = 0.000. Even though this statistical expression, used in over 97,000 manuscripts according to Google Scholar, makes regular cameo appearances in our computer printouts, we should assiduously avoid inserting it in our Results sections. This expression implies erroneously that there is a zero probability that the investigators have committed a Type I error, that is, a false rejection of a true null hypothesis (Streiner, 2007). That conclusion is logically absurd, because unless one has examined essentially the entire population, there is always some chance of a Type I error, no matter how meager. Needless to say, the expression “p < 0.000” is even worse, as the probability of committing a Type I error cannot be less than zero. Authors whose computer printouts yield significance levels of p = 0.000 should instead express these levels out to a large number of decimal places, or at least indicate that the probability level is below a given value, such as p < 0.01 or p < 0.001."

The p-value is the probability of committing a Type I error, eh? It is truly embarrassing that people who are teaching this stuff have distorted the meaning of the p-value so drastically and just keep propagating the error. I should mention though that this paper I am citing appeared in Frontiers, which I am beginning to question as a worthwhile publication venue. Who did the peer review on this paper and why did they not catch this basic mistake?

Even Fisher (p. 16 of The Design of Experiments, Second Edition, 1937) didn't buy the p-value; he is advocating for replicability as the real decisive tool:

"It is usual and convenient for experimenters to take-5 per cent. as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means, to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results. No such selection can eliminate the whole of the possible effects of chance. coincidence, and if we accept this convenient convention, and agree that an event which would occur by chance only once in 70 trials is decidedly" significant," in the statistical sense, we thereby admit that no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the "one chance in a million" will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us. In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result."

Second, I tried to clarify what a 95% confidence interval is and isn't. At least a couple of students had a hard time accepting that the 95% CI refers to the procedure and not that the true $\mu$ lies within one specific interval with probability 0.95, until I pointed out that $\mu$ is just a point value and doesn't have a probability distribution associated with it. Morey and Wagenmakers and Rouder et al have been shouting themselves hoarse about confidence intervals, and how many people don't understand them, also see this paper. Ironically, psychologists have responded to these complaints through various media, but even this response only showcases how psychologists have only a partial and misconstrued understanding of confidence intervals. I feel that part of the problem is that scientists hate to back off from a position they have taken, and so they tend to hunker down and defend defend defend their position. From the perspective of a statistician who understands both Bayes and frequentist positions, the conclusion would have to be that Morey et al are right, but for large sample sizes, the difference between a credible interval and a confidence interval (I mean the actual values that you get for the lower and upper bound) are negligible. You can see examples in our recently ArXiv'd paper.

Third, I tried to explain that there is a cultural difference between statisticians on the one hand and (most) psychologists and almost all psychologist, linguists, etc. on the other. For the latter group (with the obvious exception of people using Bayesian methods for data analysis), the whole point of fitting a statistical model is to do a hypothesis test, i.e., to get a p-value out of it. They simply do not care what the assumptions and internal moving parts of a t-test or a linear mixed model are. I know lots of users of lmer who are focused on one and only one thing: is my effect significant? I have repeatedly seen experienced experimenters in linguistics simply ignoring the independence assumption of data points when doing a paired t-test; people often do paired t-tests on unaggregated data, with multiple rows of data points for each subject (for example). This leads to spurious significance effects, which they happily and unquestioningly accept because that was the whole goal of the exercise. I show some examples in my lecture2 slides (slide 70).

It's not just linguists, you can see the consequences of ignoring the independence assumption in this reanalysis of the infamous study on how future tense marking in language supposedly influences economic decisions. Once the dependencies between languages are taken into account, the conclusion that Chen originally drew doesn't really hold up much:

" When applying the strictest tests for relatedness, and when data is not aggregated across individuals, the correlation is not significant."

Similarly, Amy Cuddy et al's study on how power posing increases testosterone levels also got published only because the p value just scraped in below 0.05 at 0.045 or so. You can see in their figure 3 reporting the testosterone increase

that their confidence intervals are huge (this is probably why they report standard errors, it wouldn't look so impressive if they had reported CIs). All they needed to show to make their point was to get the p-value below 0.05. The practical relevance of a 12 picogram/ml increase in testosterone is left unaddressed. Another recent example from Psychological Science, which seems to publish studies that might attract attention in the popular press, is this study on how ovulating women wear red. This study is a follow up on the notorious Psychological Science study by Beall and Tracy. In my opinion, the Beall and Tracy study reports a bogus result because they claim that women wear red or pink when ovulating, but when I reanalyzed their data I found that the effect was driven by pink alone. Here is my GLM fit for red or pink, red only and pink only. You can see that the "statistically significant" effect is driven entirely by pink, making the title of their paper (Women Are More Likely to Wear Red or Pink at Peak Fertility) true only if you allow the exclusive-or reading of the disjunction:

The new study by Eisenbruch et al reports a statistically significant effect on this red-pink issue, but now it's only about red:

"A mixed regression model confirmed that, within subjects, the odds of wearing red were higher during the estimated fertile window than on other cycle days, b = 0.93, p = .040, odds ratio (OR) = 2.53, 95% confidence interval (CI) = [1.04, 6.14]. The 2.53 odds ratio indicates that the odds of wearing a red top were about 2.5 times higher inside the fertile window, but there was a wide confidence interval."

To their credit, they note that their confidence interval is huge, and essentially includes 1. But since the p-value is below 0.05 this result is considered evidence for the "red hypothesis". It may well be that women who are ovulating wear red; I have no idea and have no stake in the issue. Certainly, I am not about to start looking at women wearing red as potential sexual partners (quite independent from the fact that my wife would probably kill me if I did). But it would be nice if people would try to do high powered studies, and report a replication in the same study they publish. Luckily nobody will die if these studies report mistaken results, but the same mistakes are happening in medicine, where people will die as a result of incorrect conclusions being drawn.

All these examples show why the focus on p-values is so damaging for answering research questions.

Not surprisingly, for the statistician, the main point of fitting a model (even in a confirmatory factorial analysis) is not to derive a p-value from it; in fact, for many statisticians the p-value may not even rise to consciousness. The main point of fitting a model is to define a process which describes, in the most economical way possible, how the data were generated. If the data don't allow you to estimate some of the parameters, then, for a statistician it is completely reasonable to back off to defining a simpler generative process.

This is what Gelman and Hill also explain in their 2007 book (italics mine). Note that they are talking about fitting Bayesian linear mixed models (in which parameters like correlations can be backed off to 0 by using appropriate priors; see the Stan code using lkj priors here), not frequentist models like lmer. Also, Gelman would never, ever compute a p-value.

Gelman and Hill 2007, p. 549:

"Don’t get hung up on whether a coefficient “should” vary by group. Just allow it to vary in the model, and then, if the estimated scale of variation is small (as with the varying slopes for the radon model in Section 13.1), maybe you can ignore it if that would be more convenient.
Practical concerns sometimes limit the feasible complexity of a model—for example, we might fit a varying-intercept model first, then allow slopes to vary, then add group-level predictors, and so forth. Generally, however, it is only the difficulties of fitting and, especially, understanding the models that keeps us from adding even more complexity, more varying coefficients, and more interactions."

For the statistician, simplicity of expression and understandability of the model (in the Gelman and Hill sense of being able to derive sensible posterior (predictive) distributions) are of paramount importance. For the psychologist and linguist (and other areas), what matters is whether the result is statistically significant. The more vigorously you can reject the null, the more excited you get, and the language provided for this ("highly significant") also gives the illusion that we have found out something important (=significant).

This seems to be a fundamental disconnect between statisticians, and end-users who just want their p-value. A further source of the disconnect is that linguists and psychologists etc. look for cookbook methods, what a statistician I know once derisively called a "one and done" approach. This leads to blind data fitting: load data, run single line of code, publish result. No question ever arises about whether the model even makes sense. In a way this is understandable; it would be great if there was a one-shot solution to fitting, e.g., linear mixed models. It would simplify life so much, and one wouldn't need to spend years studying statistics before one can do science. However, the same scientists who balk at studying statistics will willingly spend time studying their field of expertise. No mainstream (by which I mean Chomskyan) syntactician is going to ever use commercial software to print out his syntactic derivation without knowing anything about the syntactic theory. Yet this is exactly what these same people expect from statistical software, to get an answer without having any understanding of the underlying statistical machinery.

The bottom line that I tried to convey in my course was: forget about the p-value (except to soothe the reviewer and editor and to build your career), focus on doing high powered studies, check model assumptions, fit appropriate models, replicate your findings, and publish against your own pet theories. Understanding what all these words means requires some study, and we should not shy away from making that effort.

PS I am open to being corrected---like everyone else, I am prone to making mistakes. Please post corrections, but with evidence, in the comments section. I moderate the comments because some people post spam there, but I will allow all non-spam comments.

PPS The teaching evaluation for this course just came in from ESSLLI; here it is. I believe 5.0 is a perfect score.

Statistical methods for linguistic research: Foundational Ideas (Vasishth)

Lecturer1	4.9

Did the course content correspond to what was proposed?	4.9
Course notes	4.6
Session attendance	4.4

(19 respondents)

Very good course.
The lecturer was simply great. He has made hard concepts really easy to understand. He also has been able to keep the class interested. A real pity to miss the last lecture !!
The only reason that this wasn't the best statistics course is that I had a great lecturer at my university on this. Very entertaining, informative, and correct lecture, I can't think of anything the lecturer could do better.
Informative, deep and witty. Simply awesome.
Professor Shravan Vasishth was hands down the best lecturer at ESSLLI 2015. I envy the people who actually get to learn from him for a whole semester instead of just a week or two. The course was challenging for someone with not much background in statistics, but Professor Vasishth provided a bunch of additional material. He's the best!
Great course, very detailed explanations and many visual examples of some statistical phenomena. However, it would be better to include more information on regression models, especially with effects (model quality evaluation, etc) and more examples of researches from linguistic field.
It was an extremely useful course presented by one of the best lecturers I've ever met. Thank you!
Amazing course. Who would have thought that statistics could be so interesting and engaging? Kudos to lecturer Shravan Vasishth who managed to condense so much information into only 5 sessions, who managed to filter out only the most relevant things that will be applicable and indeed used by everyone who attendet the course and who managed to show the usefulness of the material. A great lecturer who never went on until everything was cleared up and made even the most daunting of statistical concepts seem surmountable. The only thing I'm sorry for is not having the opportunity to take his regular, semester-long statistics course so I can enjoy a more in depth look at the material and let everything settle properly. Five stars, would take again.
Absolutely great!!

Search