We can outline three general principles guiding model-based inference in science:
Simplicity and parsimony (Occam’s razor). Model selection is a classic bias-variance trade-off.
Multiple working hypotheses (Chamberlin, 1890). At any point in time there must be several hypotheses (models) under consideration, but number of alternatives should be kept small.
Strength of evidence. Providing quantitative information to judge the strength of evidence is central to science (e.g., see Royall, 1997).
Steps of the model selection approach consist in establishing a set of \(R\) relevant models, then, based on data, ranking models (and attributing them weights), then, choosing the best model among the set, and finally making inferences from this best model. To rank models, we need tools that account for the basic principles that makes a model a good model.
In 1951, Kullback and Leibler published a now-famous paper that quantified the meaning of information as related to Fisher’s concept of sufficient statistics (Kullback & Leibler, 1951). They developped the Kullback-Leibler divergence (or K-L information) that measures the information that is lost when approximating a distribution with another distribution. K-L information can also be conceptualized as the distance between the full reality and a specific model (Burnham & Anderson, 2004).
Two decades later, Akaike (1973) showed that this distance can be estimated by finding the parameters values that maximizes the probability of the data given the model. He used this relationship to derive a criterion known as the Akaike information criterion (AIC), that is:
\[\text{AIC} = -2\log(\mathcal{L}(\hat{\theta}|\text{data}))+2K\]
where \(\theta\) is the set of model parameters, \(\mathcal{L}(\hat{\theta}|\text{data})\) is the likelihood of the candidate model given the data when evaluated at the maximum likelihood estimate of \(\theta\), and \(K\) is the number of estimated parameters in the candidate model. The first component, \(-2log(\mathcal{L}(\hat{\theta}|\text{data}))\) is known as the deviance of the candidate model.
Basically, we can see that the AIC then accounts for the goodness-of-fit of a model (i.e., the strength of evidence for this model), but penalizes it for having too much parameters, that is for not being parcimonious. Clearly, the smaller AIC, the better.
It is important to acknowledge that when \(n/K\) is superior to about 40, the “small sample AIC” (second-order bias correction), called AICc, should be used (Burnham & Anderson, 2004)1.
\[\text{AICc} = \text{AIC}+\dfrac{2K(K+1)}{n-K-1}\]
The Bayesian Information Criterion (BIC), was introduced by Schwarz (1978) as a competitor to the AIC. Schwarz derived the BIC to serve as an asymptotic approximation to a transformation of the Bayesian posterior probability of a candidate model. The computation of BIC is based on the empirical log-likelihood and does not require the specification of priors.
\[\text{BIC}=-2\log(\mathcal{L}(\hat{\theta}|\text{data}))+K\log(n)\]
We can see that AIC and BIC measure the same goodness-of-fit but the penalty term of BIC is more stringent than the penalty term of AIC (for \(n>8\), \(k\cdot\log\) exceeds \(2k\)). Consequently, BIC tends to favor smaller models than AIC.
The individual AIC, AICc, or BIC values are not interpretable in absolute terms as they contain arbitrary constants and are much affected by sample size. Then it is imperative to rescale these criteria. Usually, this is done by substracting to the IC of each model the IC of the model with the minimum one:
\[\Delta_{IC}=IC_{i}-IC_{min}\]
where \(IC_{min}\) is the minimum of the \(R\) different \(IC_{i}\) values. This transformation forces the best model to have \(\Delta_{IC}=0\), while the rest of the models have positive values.
For the AIC, the simple transformation \(exp(−\Delta_{i}/2)\), for \(i = 1, 2, ..., R\), provides the likelihood of the model given the data. This is a likelihood function over the model set in the sense that \(\mathcal{L}(\theta|data, g_i)\) is the likelihood over the parameter space (for model \(g_i\)) of the parameter \(\theta\), given the data (\(x\)) and the model (\(g_i\)).
It is convenient to normalize the model likelihoods such that they sum to 1 and treat them as probabilities. Hence, we use:
\[w_{i}=\dfrac{exp(-\Delta_{i}/2)}{\sum_{r=1}^{R}exp(-\Delta_{r}/2)}.\]
The \(w_i\), called Akaike weights, are useful as the weight of evidence in favor of model \(g_i(\cdot |\theta)\) as being the actual Kullback-Leibler best model in the set. The ratios \(w_i/w_j\) are identical to the original likelihood ratios, \(\mathcal{L}(g_i|data)/\mathcal{L}(g_j|data)\), and so they are invariant to the model set, but the \(w_i\) values depend on the full model set because they sum to 1.
Evidence can be judged by the relative likelihood of model pairs as \(\mathcal{L}(g_i|x)/\mathcal{L}(g_j|x)\) or, equivalently, the ratio of Akaike weights \(w_i/w_j\). Such ratios are called evidence ratios and represend the evidence about fitted models as to which is better in a Kullback-Leibler information sense.
The term sequential analysis simply refers to the process by which one collects data until to reach a predefined level of evidence. Implementations of this idea are known as the Probability Ratio Test (Wald & Wolfowitz, 1948), or the group sequential designs in the NHST framework (for an overview, see Lakens, 2014). More recently, Schönbrodt et al. (2017) and Schönbrodt & Wagenmakers (2017) have (re)introduced sequential testing procedures using Bayes Factors, in order to avoid the pitfalls of sequential testing in the NHST framework.
ESTER
aims to address at least two different questions. On the first hand, it allows the exploration of the characteristics of the evidence ratios computed from the Akaike weights, when these are used in a sequential way. To this end, the simER
function allows exploration of these characteristics for evidence ratios computed from the AIC (or AICc) and the BIC.
On the other hand, ESTER
is also designed to run sequential analyses on observed data, which is implemented in the seqER
function. To sum, two main questions are adressed.
Simulation. Given an effect size and a sample size, what should I reasonnably expect ?
Observed data. When to stop recruiting participants ?
This simER
function runs a simulated study in which we compare two independant groups, for various effect size (cohensd
) and sample size (nmax
). The nmin
argument serves to specify from which participant we want to start doing sequential testing (we usually recommend to avoid nmin
< 10). We can define a boundary
at which we would like to stop the sequential testing, as well as how many simulations we want to evaluate (nsims
).
library(ESTER)
sim <-
simER(cohensd = 0.8, nmin = 20, nmax = 100, boundary = 10,
nsims = 100, ic = bic, cores = 2, verbose = FALSE)
plot(sim, log = TRUE, hist = TRUE)
Under the hood, simER
uses the ictab
function, which allows to compare a set of models fitted to the same data. This function returns either the AIC or the BIC of each model, along with the number of parameters \(k\) and the Akaike weights2.
data(mtcars)
mod1 <- lm(mpg ~ cyl, mtcars)
mod2 <- lm(mpg ~ cyl + disp, mtcars)
mod3 <- lm(mpg ~ cyl * disp, mtcars)
mods <- list(m1 = mod1, m2 = mod2, m3 = mod3)
ictab(mods, bic)
## modnames ic k delta_ic ic_wt
## 3 m3 166.4781 5 0.0000 0.9388
## 2 m2 173.0086 4 6.5305 0.0359
## 1 m1 173.7036 3 7.2256 0.0253
Note that in these two functions, the information criterion has to be passed in lowercase, as aic
and bic
refer to functions from ESTER
.
On the other hand, ESTER
can be used to do sequential testing on your own data. You can study the evolution of sequential ERs using the seqER
function.
data(mtcars)
mod1 <- lm(mpg ~ cyl, mtcars)
mod2 <- lm(mpg ~ cyl + disp, mtcars)
plot(seqER(ic = bic, mod1, mod2, nmin = 10) )
In addition, seqER
allows you to study the behavior of sequential ERs computed on your own data, along with sequential ERs computed on permutation samples. This feature might be useful to study to what extent the evolution of evidence ratios you observed on the original sample was dependant to the order of the observations.
plot(seqER(ic = bic, mod1, mod2, nmin = 10, nsims = 10) )
When data involve repeated measures (and so multiple lines by subject), it is important to help ESTER
to figure out how the data is structured by indicating in which column are stored subject or observations numbers. For instance, below we use data from lme4
, and indicate that the subject number is stored in the "Subject"
column of the dataframe, by passing it to the id
argument of seqER
.
library(lme4)
## Loading required package: Matrix
data(sleepstudy)
mod1 <- lmer(Reaction ~ Days + (1|Subject), sleepstudy)
mod2 <- lmer(Reaction ~ Days + I(Days^2) + (1|Subject), sleepstudy)
plot(seqER(ic = bic, mod1, mod2, nmin = 10, id = "Subject", nsims = 10) )
If no information is passed to the id
argument, seqER
will suppose that there is only one observation (i.e., one line) per subject.
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Caski (Eds.), Proceedings of the Second International Symposium on Information Theory (pp. 267-281). Budapest: Akademiai Kiado.
Burnham, K. P., & Anderson, D. R. (2002). Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach (2nd ed). Ecological Modelling.
Burnham, K. P., & Anderson, D. R. (2004). Multimodel Inference: Understanding AIC and BIC in Model Selection. Sociological Methods & Research, 33(2), 261–304.
Burnham, K. P., Anderson, D. R., & Huyvaert, K. P. (2011). AIC model selection and multimodel inference in behavioral ecology: Some background, observations, and comparisons. Behavioral Ecology and Sociobiology, 65(1), 23–35.
Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses. European Journal of Social Psychology, 44, 701–710.
Royall, R. (1997) Statistical Evidence: A likelihood paradigm, Chapman and Hall, CRC Press.
Schönbrodt, F. D., & Wagenmakers, E.-J. (2017). Bayes Factor Design Analysis: Planning for compelling evidence. Psychonomic Bulletin & Review.
Schönbrodt, F. D., Wagenmakers, E.-J., Zehetleitner, M., & Perugini, M. (2017). Sequential hypothesis testing with Bayes factors: Efficiently testing mean differences. Psychological Methods, 22, 322–339.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461-464.
Wagenmakers, E.-J., & Farrell, S. (2004). AIC model selection using Akaike weights. Psychonomic Bulletin & Review, 11(1), 192–196.
Wald, A., & Wolfowitz, J. (1948). Optimum character of the sequential probability ratio test. The Annals of Mathematical Statistics, 19, 326– 339.