How many principal components?

Better than scree, even in beautiful settings.

Very often we find ourselves handling large matrices and, for multiple different reasons, may want or need to reduce the dimensionality of the data by retaining the most relevant principal components. A question that arises then is how many such principal components should be retained? That is, what are the principal components that provide information that can be distinguished from variability that is indistinguishable from random noise?

One simple heuristic method is the scree plot (Cattell, 1966): one computes the singular values of the matrix, plots them in descending order, and visually looks for some kind of “elbow” (or the starting point of the “scree”) to be used as a threshold. Those singular vectors (principal components) that have corresponding singular values larger than that threshold are retained, otherwise discarded. Unfortunately, the scree test is too subjective, and worse, many realistic problems can produce matrices that have multiple such “elbows”.

Many other methods to help select the number of principal components to retain exist. Some, for example, are based on information theory, or on probabilistic formulations, or on hard thresholds related the size of the matrix or the (explained) variance from the data. In this article, the little known Wachter method (Wachter, 1976) is presented.

Let \mathbf{X} be a n \times p matrix of random values with zero mean and variance \sigma^2. The density of the bulk spectrum of the singular values \lambda_k, k \in \left\{1,\ldots,K\right\}, K = \mathrm{min}(n,p), of \mathbf{X}'\mathbf{X}/n is:

p(\lambda;y,\sigma^2) = \begin{cases}\sqrt{(b-\lambda)(\lambda-a)}/(2\pi \lambda y \sigma^2), & \text{if } a \leqslant \lambda \leqslant b\\0, & \text{otherwise}\end{cases}

and a point mass 1-1/y at the origin if y > 1, where a = \sigma^2(1-\sqrt{y})^2, b = \sigma^2(1+\sqrt{y})^2, and n/p \rightarrow y \in (0,\infty) (Bai, 1999). This is the Marčenko and Pastur (1967) distribution. The cumulative distribution function (cdf) can be obtained by integration, i.e., P(\lambda;y,\sigma^2) = \int_a^\lambda p(x;y,\sigma^2)\mathrm{d}x. MATLAB/Octave functions that compute the cdf and its inverse are here: mpcdf.m and mpinv.m.

If we know the probability of finding a singular value \lambda or larger for a random matrix, we have the means to judge whether that \lambda is sufficiently large to be considered interesting enough for its corresponding singular vector (principal component) to be retained. Simply computing the p-value, however, is not enough, because the distribution makes no reference to the position k of the sorted singular values. A different procedure needs be considered.

The core idea of the Wachter method is to use a QQ plot of the observed singular values versus the quantiles obtained from the inverse cdf of the Marčenko–Pastur distribution, and use eventual deviations from the identity line to help finding the threshold that separates the “good” from the “unnecessary” principal components. That is, plot the eigenvalues \lambda_{K+1-k} versus the quantiles P^{-1}((k-\frac{1}{2})/K;y,\sigma^2). Deviations from the identity line are evidence for excess variance not expected from random variables alone.

The function wachter.m computes the singular values from a given observed matrix \mathbf{X}, the expected singular values should \mathbf{X} be a random matrix, as well as the p-values for each singular value in either case. The observed and expected singular values can be used to built a QQ-plot. The p-values can be used for a comparison, e.g., via the logarithm of their ratio. For example:

% For reproducibility, reset the random number generator.
% Use "rand" instead of "rng" to ensure compatibility with old versions.

% Simulate data. See Gavish and Donoho (2014) for this example.
X = diag([1.7 2.5 zeros(1,98)]) + randn(100)*sqrt(.01); 

% Compute the expected and observed singular values,
% as well as the respective cumulative probabilities (p-values).
% See the help of wachter.m for syntax.
[Exp,Obs,Pexp,Pobs] = wachter(X,[],false,true);

% Log of the ratio between observed and expected p-values.
% Large values are evidence for "good" components.
P_ratio = -log10(Pobs./Pexp);

% Plot the spectrum.
title('Singular values');
xlabel('index (k)');

% Construct the QQ-plot.
title('QQ plot');

The result is:

In this example, two of the singular values can be considered “good”, and should be retained. The others can, according to this criterion, be dropped.

The function can take into account nuisance variables (provided in the 2nd argument), including intercept (for mean-centering), allows normalization of columns to unit variance, and can operate on p-values (upper tail from the density function) or on the cumulative probabilities. See the help text inside the function for usage information.


The image at the top is of the Drei Zinnen, in the Italian Alps during the Summer, in which a steep slope scattered with small stones (scree) is visible. Photo by Heinz Melion from Pixabay.

Simplifying Freedman-Lane

Doing a permutation test with the general linear model (GLM) in the presence of nuisance variables can be challenging. Let the model be:

\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon}

where \mathbf{Y} is a matrix of observed variables, \mathbf{X} is a matrix of predictors of interest, \mathbf{Z} is a matrix of covariates (of no interest), and \boldsymbol{\epsilon} is a matrix of the same size as \mathbf{Y} with the residuals.

Because the interest is in testing the relationship between \mathbf{Y} and \mathbf{X}, in principle it would be these that would need be permuted, but doing so also breaks the relationship with \mathbf{Z}, which would be undesirable. Over the years, many methods have been proposed. A review can be found in Winkler et al. (2014); other previous work include the papers by Anderson and Legendre (1999) and Anderson and Robinson (2001).

One of these various methods is the one published in Freedman and Lane (1983), which consists of permuting data that has been residualised with respect to the covariates, then estimated covariate effects added back, then the full model fitted again. The procedure can be performed through the following steps:

  1. Regress \mathbf{Y} against the full model that contains both the effects of interest and the nuisance variables, i.e., \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon}. Use the estimated parameters \boldsymbol{\hat{\beta}} to compute the statistic of interest, and call this statistic T_{0}.
  2. Regress \mathbf{Y} against a reduced model that contains only the covariates, i.e. \mathbf{Y} = \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon}_{\mathbf{Z}}, obtaining estimated parameters \boldsymbol{\hat{\gamma}} and estimated residuals \boldsymbol{\hat{\epsilon}}_{\mathbf{Z}}.
  3. Compute a set of permuted data \mathbf{Y}^{*}_{j}. This is done by pre-multiplying the residuals from the reduced model produced in the previous step, \boldsymbol{\hat{\epsilon}}_{\mathbf{Z}}, by a permutation matrix, \mathbf{P}{j}, then adding back the estimated nuisance effects, i.e. \mathbf{Y}^{*}_{j} = \mathbf{P}_{j}\boldsymbol{\hat{\epsilon}}_{\mathbf{Z}} + \mathbf{Z}\boldsymbol{\hat{\gamma}}.
  4. Regress the permuted data \mathbf{Y}^{*}_{j} against the full model, i.e. \mathbf{Y}^{*}_{j} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon}
  5. Use the estimated \boldsymbol{\hat{\beta}}^{*}_{j} to compute the statistic of interest. Call this statistic T^{*}_{j}.
  6. Repeat the Steps 2-4 many times to build the reference distribution of T^{*} under the null hypothesis of no association between \mathbf{Y} and \mathbf{X}.
  7. Count how many times T^{*}_{j} was found to be equal to or larger than T_{0}, and divide the count by the number of permutations; the result is the p-value.

Steps 1-4 can be written concisely as:

\left(\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}+\mathbf{H}_{\mathbf{Z}}\right) \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma}+\boldsymbol{\epsilon}

where \mathbf{P}_{j} is a permutation matrix (for the j-th permutation, \mathbf{H}_{\mathbf{Z}}=\mathbf{Z}\mathbf{Z}^{+} is the hat matrix due to the covariates, and \mathbf{R}_{\mathbf{Z}} = \mathbf{I} - \mathbf{H}_{\mathbf{Z}} is the residual forming matrix; the superscript symbol ^{+} represents a matrix pseudo-inverse.

In page 385 of Winkler et al. (2014), my colleagues and I state that:

[…] add the nuisance variables back in Step 3 is not strictly necessary, and the model can be expressed simply as \mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}\mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\mathbf{Z}\boldsymbol{\gamma}+\boldsymbol{\epsilon}, implying that the permutations can actually be performed just by permuting the rows of the residual-forming matrix \mathbf{R}_{\mathbf{Z}}.

However, in the paper we do not offer any proof of this important result, that allows algorithmic acceleration. Here we remedy that. Let’s start with two brief lemmata:

Lemma 1: The product of a hat matrix and its corresponding residual-forming matrix is zero, that is, \mathbf{R}_{\mathbf{Z}}\mathbf{H}_{\mathbf{Z}} = \mathbf{H}_{\mathbf{Z}}\mathbf{R}_{\mathbf{Z}} = \mathbf{0}.

This is because \mathbf{R}_{\mathbf{Z}} = \mathbf{I} - \mathbf{H}_{\mathbf{Z}}, hence \mathbf{R}_{\mathbf{Z}}\mathbf{H}_{\mathbf{Z}} = \mathbf{R}_{\mathbf{Z}}(\mathbf{I} - \mathbf{R}_{\mathbf{Z}}) = \mathbf{R}_{\mathbf{Z}} - \mathbf{R}_{\mathbf{Z}}\mathbf{R}_{\mathbf{Z}} = \mathbf{R}_{\mathbf{Z}} - \mathbf{R}_{\mathbf{Z}} = \mathbf{0} since \mathbf{R}_{\mathbf{Z}} is idempotent.

Lemma 2 (Frisch–Waugh–Lovell theorem): Given a GLM expressed as \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon}, we can estimate \boldsymbol{\beta} from an equivalent GLM written as \mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta} + \mathbf{R}_{\mathbf{Z}}\boldsymbol{\epsilon}.

To see why, remember that multiplying both sides of an equation by the same factor does not change it (least squares solutions may change; transformations using Lemma 2 below do not act on the fitted model). Let’s start from:

\mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}(\mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon})

Then remove the parentheses:

\mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta} + \mathbf{R}_{\mathbf{Z}}\mathbf{Z}\boldsymbol{\gamma} + \mathbf{R}_{\mathbf{Z}}\boldsymbol{\epsilon}

Since \mathbf{R}_{\mathbf{Z}} = \mathbf{I} - \mathbf{H}_{\mathbf{Z}}:

\mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta} + (\mathbf{I}-\mathbf{H}_{\mathbf{Z}})\mathbf{Z}\boldsymbol{\gamma} + \mathbf{R}_{\mathbf{Z}}\boldsymbol{\epsilon}

and that \mathbf{H}_{\mathbf{Z}} = \mathbf{Z}\mathbf{Z}^{+}:

\mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta} + (\mathbf{Z}-\mathbf{Z}\mathbf{Z}^{+}\mathbf{Z})\boldsymbol{\gamma} + \mathbf{R}_{\mathbf{Z}}\boldsymbol{\epsilon}

Since \mathbf{Z}^{+}\mathbf{Z}=\mathbf{I}:

\mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta} + \mathbf{0}\boldsymbol{\gamma} + \mathbf{R}_{\mathbf{Z}}\boldsymbol{\epsilon}

\mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}_{\mathbf{Z}}

where \boldsymbol{\epsilon}_{\mathbf{Z}}=  \mathbf{R}_{\mathbf{Z}}\boldsymbol{\epsilon}.

Main result

Now we are ready for the main result. The Freedman-Lane model is:

\left(\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}+\mathbf{H}_{\mathbf{Z}}\right) \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma}+\boldsymbol{\epsilon}

Per Lemma 2, it can be rewritten as:

\mathbf{R}_{\mathbf{Z}}\left(\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}+\mathbf{H}_{\mathbf{Z}}\right) \mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}_{\mathbf{Z}}

Dropping the parenthesis:

\mathbf{R}_{\mathbf{Z}}\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}\mathbf{Y} +\mathbf{R}_{\mathbf{Z}}\mathbf{H}_{\mathbf{Z}} \mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}_{\mathbf{Z}}

Per Lemma 1:

\mathbf{R}_{\mathbf{Z}}\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}\mathbf{Y} + \mathbf{0}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}_{\mathbf{Z}}

\mathbf{R}_{\mathbf{Z}}\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}\mathbf{Y}= \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}_{\mathbf{Z}}

What is left has the same form as the result of Lemma 2. Thus, reversing it, we obtain the final result:

\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}} \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} +\boldsymbol{\epsilon}

Hence, the hat matrix \mathbf{H}_{\mathbf{Z}} cancels out, meaning that it is not necessary. Results are the same both ways.



The German tank problem and the novel coronavirus

We often think of statistics as a way to summarize large amounts of data. For example, we can collect data from thousands of subjects, and extract a single number that tells something about these subjects. The well known German tank problem shows that, in a certain way, statistics can also be used for the opposite: using incomplete data and a few reasonable assumptions (or real knowledge), statistics provides way to estimate information that offer a panoramic view of all the data. Historical problems are interesting on their own. Yet, it is not always that we see so clearly consequential historical events at the time they happen — like now.

In the Second World War, as in any other war, information could be more valuable than anything else. Intelligence reports (such as from spies) would feed the Allies with information about the industrial capacity of Nazi Germany, including details about things such as the number of tanks produced. This kind of information can have far reaching impact and not only determine the outcome of a battle, but also if a battle would even even happen or with what preparations, as the prospect of finding a militarily superior opponent is often a great deterrent.

Sometimes, German tanks, as the well known Panzer, could be captured and carefully inspected. Among the details noted were the serial number printed in various pieces, such as chassis, gearboxes, and the serial numbers of the moulds used to produce the wheels. With the serial number of even a single chassis, for example one can estimate the total number of tanks produced; knowing the serial number of a single wheel mould allows the estimation of the total number of moulds, and thus, how many wheels can be produced in a certain amount of time. But how?

If serial numbers are indeed serial, e.g., \{1, 2, 3, \ldots, N\}, growing uniformly and without gaps, and we see a tank that has a serial number S, then clearly at least S tanks must have been produced. But could we have a better guess?

Let’s start by reversing the problem: suppose we knew N. In that case, what would be the average value of the serial numbers of all N tanks? The average for uniformly distributed data like this would be M = \frac{1 + N}{2}, that is, the average of the first and last serial numbers.

Now, say we have only one sighting of a tank, and that has serial number S. Then our best guess for the average serial number is S itself, as we have no additional information. Thus, with M = S, our guess would be N = 2S - 1 (that is, reorganizing the terms of the previous equation for M). Note that, for one sighting, this formula guarantees that N is larger or equal than S, which makes sense: we cannot have an estimate for N that is smaller than the serial number S itself.

What if we had not just one, but multiple sightings? Call the number of sightings K. The mean is now M = \frac{S_1 + S_2 + \ldots + S_K}{K}, for ordered serial numbers \{S_1, S_2, \ldots, S_K\}. Clearly, we can’t use the same formula, because if M is much smaller than S_K (say, because we have seen many small serial numbers, but just a handful of larger ones), N could incorrectly be estimated as less than S_K, which makes no sense. At least S_K tanks must exist.

While incorrect for K > 1, the above formula gives invaluable insight: it shows that for such uniformly distributed data, approximately half of the tanks have serial number above M, the other half below M. Extending the idea, and still under the assumption that the serial numbers are uniform, we can conclude that the number of tanks below the lowest serial number S_1 (which is S_1 - 1) must be approximately the same as the (unknown) number of tanks above the highest serial number S_K. So, a next better estimate could be to use N = S_K + S_1 - 1.

We can still do better, though. Since we have K sightings, we can estimate what is the average interval between sightings, i.e., \frac{S_K}{K}. As it is based on all K sightings, this gives a better estimate of the spacing between the serial numbers than the single sighting S_1. The result can be added to S_K. The final estimate then becomes N = S_K + \frac{S_K}{K} - 1.

To make this concrete, say we saw tanks numbered \{47, 62, 104, 155, 159\}. Then our best guess would be N \approx 190.

At the end of the war, estimates obtained using the above method proved remarkably accurate, much more so than information provided by spies and other intelligence reports.

Let’s now see a similar example that is contemporary to us. Take the current pandemic caused by a novel coronavirus. The World Health Organization stated officially, in 14th January 2020, when there were 41 cases officially reported in China, that there was no evidence for human-to-human transmission. Yet, when the first 3 cases outside China were confirmed in 16th January 2020, epidemiologists at the Imperial College London were quick to find out that the WHO statement must have not been true. Rather, the real number of cases was likely well above 1700.

How did they make that estimate? The key insight was the realisation that only a small number of people in any major city travels internationally, particularly in such a short time span like that given by the time until the onset of symptoms for this kind of respiratory disease. If one can estimate prevalence among those who travelled, that would be a good approximation to the prevalence among those who live in the city, assuming that those who travel are an unbiased sample of the population.

Following this idea, we have: \frac{C_t}{N_t} \approx \frac{C_s}{N_s}, that is, the number of cases among those who travelled (C_t) divided by the total number of people who travelled (N_t) is expected to be approximately the same as the number of cases among those who stayed (C_s) divided by the total number of people who stayed (live) in the city (N_s).

The number of people served by the international airport of Wuhan is about 19 million (the size of the Wuhan metropolitan area), and the average daily number of outbound international passengers in previous years was 3301 for that time of the year (a figure publicly known, from IATA). Unfortunately, little was known outside China about the time taken between exposure to the virus and the onset of symptoms. The researchers then resorted to a proxy: the time known for the related severe respiratory disorder known as MERS, also caused by a coronavirus, which is about 10 days. Thus, we can estimate N_t= 3301 \times 10=33010 people travelling out, and N_s = 19,000,000 staying in the city. The number of known international cases was at the time C_t = 3. Hence:

C_s \approx \frac{3\times 19,000,000}{33010}\approx 1727 cases

So, using remarkably simple maths, simpler even than in our WWII German tank example, the scientists estimated that the number of actual cases in the city of Wuhan was likely far above the official figure of 41 cases. The researchers were careful to indicate that, should the probability of travelling be higher among those exposed, the number of actual cases could be smaller. The converse is true: should travellers be wealthier (thus less likely to be exposed to a possible zoonosis as initially reported), the number of actual cases could be higher.

Importantly, it is not at all likely that 1700 people would have contracted such a zoonosis from wild animals in a dense urban area like Wuhan, hence human-to-human transmission must had been occurring. Eventually the WHO confirmed human-to-human transmission on 19th January 2020. Two days later, Chinese authorities began locking down and sealing off Wuhan, thus putting into place a plan to curb the transmission.

To find out more about the original problem of the number of tanks, and also for other methods of estimation for the same problem, a good start is this article. Also invaluable, for various estimation problems related to the fast dissemination of the novel coronavirus, are all the reports by the epidemiology team at the Imperial College London, which can be found here.

Redundancy in canonical correlation analysis

In canonical correlation analysis (CCA; Hotelling, 1936), the absolute value of a correlation is not always that helpful. For example, large canonical correlations may arise simply due to a large number of variables being investigated using a relatively small sample size; high correlations may arise simply because there are too many opportunities for finding mixtures in both sides that are highly correlated one with another.

Motivated by this perceived difficulty in the interpretation of results, Stewart and Love (1968) proposed the computation of what has been termed a redundancy index. It works as follows.

Let \mathbf{Y}_{N \times P} and \mathbf{X}_{N \times Q} be two sets of variables over which CCA is computed. We find canonical coefficients \mathbf{A}_{P \times K} and \mathbf{B}_{Q \times K}, K=\min(P,Q) such that the canonical variables \mathbf{U}_{N \times K} and \mathbf{V}_{N \times K} have maximal, diagonal correlation structure; this diagonal contains the ordered canonical correlations r_k.

Now that CCA has been computed, we can find the correlations between the original variables and the canonical coefficients. Let \mathbf{\tilde{A}}_{P \times K}=\text{corr}(\mathbf{Y},\mathbf{U}) and \mathbf{\tilde{B}}_{Q \times K}=\text{corr}(\mathbf{X},\mathbf{V}) be such correlations, which are termed canonical loadings or structure coefficients. Now compute the mean square for each of the columns of \mathbf{\tilde{A}} and \mathbf{\tilde{B}}. These represent the variance extracted by the corresponding canonical variable. That is:

  • Variance extracted by canonical variable \mathbf{u}_{k}: \upsilon_k = \frac{1}{P}\sum_{p=1}^{P}\tilde{a}_{pk}^{2}
  • Variance extracted by canonical variable \mathbf{v}_{k}: \nu_k = \frac{1}{Q}\sum_{q=1}^{Q}\tilde{b}_{qk}^{2}

These quantities represent the mean variance extracted from the original variables by each of the canonical variables (in each side).

Compute now the proportion of variance of one canonical variable (say, \mathbf{u}_{k}) explained by the corresponding canonical variable in the other side (say, \mathbf{v}_{k}). This is given simply by r_k^2, the usual coefficient of determination.

The redundancy index for each canonical variable is then the product of \upsilon_k and r_k^2 for the left side of CCA, and the product of \nu_k and r_k^2 for the right side. That is, the index is not symmetric. It measures the proportion of variance in one of the two set of variables explained by the correlation between the k-th pair of canonical variables.

The sum of the redundancies for all K canonical variables in one side or another forms a global redundancy metric, which indicates the proportion of the variance in a given side explained by the variance in the other.

This global redundancy can be scaled to unity, such that the redundancies for each of the canonical variables in a give side can be interpreted as the proportion of total redundancy.

If you follow the original paper by Stewart and Love (1968), \upsilon_k and \nu_k are column III of Table 2, the redundancy of each canonical variable for each side corresponds to column IV, and the proportion of total redundancy is in column V.

Another reference on the same topic that is worth looking is Miller (1981). In it, the author discusses that redundancy is somewhere in between CCA itself (fully symmetric) and multiple regression (fully asymmetric).



  • 28.Jun.2020: A script that computes the redundancy indices is available here: redundancy.m (work with Thomas Wassenaar, University of Oxford).

A higher Octave for PALM

PALM — Pemutation Analysis of Linear Models — uses either MATLAB or Octave behind the scenes. It can be executed from within either of these environments, or from the shell, in which case either of these is invoked, depending on how PALM was configured.

For users who do not want or cannot spend thousands of dollars in MATLAB licenses, Octave comes for free, and offers quite much the same benefits. However, for Octave, some functionalities in PALM require version 4.4.1 or higher. However, stable Linux distributions such as Red Hat Enterprise Linux and related (such as CentOS and Scientific Linux) still offer only 3.8.2 in the official repositories at the time of this writing, leaving the user with the task of finding an unofficial package or compiling from the source. The latter task, however, can be daunting: Octave is notoriously difficult to compile, with a myriad of dependencies.

A much simpler approach is to use Flatpak or Snappy. These are systems for distribution of Linux applications. Snappy is sponsored by Canonical (that maintains Ubuntu), whereas Flatpak appears to be the preferred tool for Fedora and openSUSE. Using either system is quite simple, and here the focus is on Flatpak.

To have a working installation of Octave, all that needs be done is:

1) Make sure Flatpak is installed:

On a RHEL/CentOS system, use (as root):

yum install flatpak

For openSUSE, use (as root):

zypper install flatpak

For Ubuntu and other Debian-based systems:

sudo apt install flatpak

2) Add the Flathub repository:

flatpak remote-add --if-not-exists flathub

3) Install Octave:

flatpak install flathub org.octave.Octave

4) Run it!

flatpak run org.octave.Octave

Only the installation of Flatpak needs be done as root. Once it has been installed, repositories and applications (such as Octave, among many others) can be installed at the user level. These can also be installed and made available system-wide (if run as root).

Configuring PALM

From version alpha117 onwards, the executable file ‘palm’ (not to be confused with ‘palm.m’) will include a variable named “OCTAVEBIN”, which specifies how Octave should be called. Change it from the default:


so that it invokes the version installed with Flatpak:

OCTAVEBIN="/usr/bin/flatpak run org.octave.Octave"

After making the above edits, it should be possible to run PALM directly from the system shell using the version installed via Flatpak. Alternatively, it should also be possible to invoke Octave (as in step 4 above), then use the command “addpath” to specify the location of palm.m, and then call PALM from the Octave prompt.

Octave packages

Handling of packages when Octave is installed via Flatpak is the same as usual, that is, via the command ‘pkg’ run from within Octave. More details here.

Installing NiDB

NiDB is a light, powerful, and simple to use neuroimaging database. One of its main strengths is that it was developed using a stack of stable and proven technologies: Linux, Apache, MySQL/MariaDB, PHP, and Perl. None of these technologies are new, and the fact that they have been around for so many years means that there is a lot of documentation and literature available, as well as a myriad of libraries (for PHP and Perl) that can do virtually anything. Although both PHP and Perl have received some degree of criticism (not unreasonably), and in some cases are being replaced by tools such as Node.js and Python, the volume of information about them means it is easy to find solutions when problems appear.

This article covers installation steps for either CentOS or RHEL 7, but similar steps should work with other distributions since the overall strategy is the same. By separating apart each of the steps, as opposed to doing all the configuration and installation as a single script, it becomes easier to adapt to different systems, and to identify and correct problems that may arise due to local particularities. The steps below are derived from the scripts and, that are available in the NiDB repository, but here these will be ignored. Note that the instructions below are not “official”; for the latter, consult the NiDB documentation.  The intent of this article is to facilitate the process and mitigate some frustration you may feel if trying to do it all by yourself. Also, by looking at the installation steps, you should be able to have a broad overview of the pieces that constitute database.

1) Begin with a fresh install.

If installing CentOS from the minimal DVD, choose a “Minimal Install” and leave to add the desktop in the next step.

2) Update the system.

This is a good time to install the most recent updates and patches, and reboot if the updates include a new kernel:

yum update

3) Have a graphical mode.

While not strictly necessary, having a graphical interface for a web-based application will be handy. Install your favourite desktop, and a VNC server if you intend to manage the system remotely. For a lightweight desktop, consider MATE:

First add the EPEL repository. Depending on what you already have configured, use either:

yum install epel-release


yum install


yum groupinstall "MATE Desktop"
systemctl set-default
systemctl isolate
systemctl enable lightdm
systemctl start lightdm

For VNC, there are various options available. Consider, for example, TurboVNC.

4) Define some environment variables to be used later.

These will help when entering the commands later.

# Directory where NiDB will be installed

# Directory of the webpages and PHP files:

# Linux username under which NiDB will run:

# MySQL/MariaDB root password:

# MySQL/MariaDB username that will have access to the database, and associated password:

These variables are only used during the installation, and all the steps here are done as root. Considering clearing your shell history at the end, so as not to have your passwords stored there.

5) Create an account for the user under which NiDB will run.

This is the user that will run the processes related to the database. It is not necessary that this user has administrative privileges on the system, and from a security perspective, it is better if not.

useradd -m ${NIDBUSER}
passwd ${NIDBUSER} # choose a sensible password

6) Install and configure Apache.

Add the repository for a more recent version, then install:

yum install httpd

Configure it to run as the ${NIDBUSER} user:

sed -i "s/User apache/User ${NIDBUSER}/" /etc/httpd/conf/httpd.conf
sed -i "s/Group apache/Group ${NIDBUSER}/" /etc/httpd/conf/httpd.conf

Enable it at boot, and also start it now:

systemctl enable httpd.service
systemctl start httpd.service

Open the relevant ports in the firewall, then reload the rules:

firewall-cmd --permanent --add-port=80/tcp
firewall-cmd --permanent --add-port=443/tcp
firewall-cmd --reload

7) Install and configure MySQL/MariaDB.

For MariaDB 10.2, the repository can be added to /etc/yum.repos.d/ as:

echo "[mariadb]
name = MariaDB
baseurl =
gpgkey =
gpgcheck = 1" >> /etc/yum.repos.d/MariaDB.repo

For other versions or distributions, visit this address. Then do the actual installation:

yum install MariaDB-server MariaDB-client

Enable it at boot and start now too:

systemctl enable mariadb.service
systemctl start mariadb.service

Secure the MySQL/MariaDB installation:


Pay attention to the questions on the root password and set it here to what was chosen in the ${MYSQLROOTPASS} variable. Make sure your database is secure.

8) Install and configure PHP.

First add the repositories for PHP 7.2:

yum install
yum install yum-utils
yum-config-manager --enable remi-php72
yum install php php-mysql php-gd php-process php-pear php-mcrypt php-mbstring

Install some additional PHP packages:

pear install Mail
pear install Mail_Mime
pear install Net_SMTP

Edit the PHP configuration:

sed -i 's/^short_open_tag = .*/short_open_tag = On/g' /etc/php.ini
sed -i 's/^session.gc_maxlifetime = .*/session.gc_maxlifetime = 28800/g' /etc/php.ini
sed -i 's/^memory_limit = .*/memory_limit = 5000M/g' /etc/php.ini
sed -i 's/^upload_tmp_dir = .*/upload_tmp_dir = \/${NIDBROOT}\/uploadtmp/g' /etc/php.ini
sed -i 's/^upload_max_filesize = .*/upload_max_filesize = 5000M/g' /etc/php.ini
sed -i 's/^max_file_uploads = .*/max_file_uploads = 1000/g' /etc/php.ini
sed -i 's/^max_input_time = .*/max_input_time = 600/g' /etc/php.ini
sed -i 's/^max_execution_time = .*/max_execution_time = 600/g' /etc/php.ini
sed -i 's/^post_max_size = .*/post_max_size = 5000M/g' /etc/php.ini
sed -i 's/^display_errors = .*/display_errors = On/g' /etc/php.ini
sed -i 's/^error_reporting = .*/error_reporting = E_ALL \& \~E_DEPRECATED \& \~E_STRICT \& \~E_NOTICE/' /etc/php.ini

Also, edit /etc/php.ini to make sure your timezone is correct, for example:

date.timezone = America/New_York

For a list of time zones, see here. Finally:

chown -R ${NIDBUSER}:${NIDBUSER} /var/lib/php/session

9) Install Perl and other pieces.

These are all in the main repositories already added so you should be able to simply run:

yum install perl* cpan git gcc gcc-c++ java ImageMagick vim libpng12 libmng wget iptraf* pv

Install also various Perl packages from CPAN. The first time you run cpan, various configuration questions will be asked; it is safe to accept default answers for all:

cpan File::Path
cpan Net::SMTP::TLS
cpan List::Util
cpan Date::Parse
cpan Image::ExifTool
cpan String::CRC32
cpan Date::Manip
cpan Sort::Naturally
cpan Digest::MD5
cpan Digest::MD5::File
cpan Statistics::Basic
cpan Email::Send::SMTP::Gmail
cpan Math::Derivative

Then put these into a place where NiDB can find them:

mkdir /usr/local/lib64/perl5
cp -rv /root/perl5/lib/perl5/* /usr/local/lib64/perl5/

10) (Optional) Disable SELinux.

Disabling SELinux is not strictly necessary provided that you ensure that all processes related to NiDB (webserver, database server), and all its files, belong to the same user, nidb, and that file access policies are set correctly. In any case, you may feel this is useful so as to stop receiving too many irrelevant warnings during the installation. You can enable it again later.

sed -i 's/^SELINUX=.*/SELINUX=disabled/g' /etc/selinux/config
setenforce 0

Note that enabling or disabling SELinux requires a reboot to take effect (it is not sufficient to simply restart a daemon; there is not one in fact).

11) Install FSL.

FSL functions are used by various internal scripts. After the installation, make sure the environment variable FSLDIR exists and points to the correct location (typically /usr/local/fsl, but can be different if you installed it elsewhere). This variable is used below when defining the crontab jobs.


12) Download and install the NiDB files.

The official Github repository is However, I have made a fork with a couple of changes that better adapt to the system I am working with. You can probably go with either way.

mkdir -p ${NIDBROOT}
mkdir -p archive backup dicomincoming deleted download ftp incoming problem programs/lock programs/logs uploadtmp uploaded
git clone install
cd install
cp -Rv setup/Mysql* /usr/local/lib64/perl5/
cp -Rv programs/* ${NIDBROOT}/programs/
cp -Rv web/* ${WWWROOT}/

Edit the file ${WWWROOT}/functions.php and complete two pieces of configuration. Locate these two lines:

$cfg = LoadConfig();

In the first parenthesis, (), put what you get when you run:

echo "${NIDBROOT}/programs/nidb.cfg"

whereas in the second (), put what you get when you run:

timedatectl | grep "Time zone:" | awk '{print $3}'

For example, depending on your variables and time zone, you could edit to look like this:

$cfg = LoadConfig("/nidb/programs/nidb.cfg");

13) Set up the database.

First, create the nidb user in MySQL/MariaDB. This is the only user (other than root) that will be able to do anything in the database:


Now create the NiDB database proper:

cd ${NIDBROOT}/install/setup
mysql -uroot -p${MYSQLROOTPASS} nidb < nidb.sql
mysql -uroot -p${MYSQLROOTPASS} nidb < nidb-data.sql

14) Setup cron jobs.

These jobs will take care of various automated input/output tasks.

cat <<EOC > ~/tempcron.txt
* * * * * cd ${NIDBROOT}/programs; perl > /dev/null 2>&1
* * * * * cd ${NIDBROOT}/programs; perl > /dev/null 2>&1
* * * * * cd ${NIDBROOT}/programs; perl > /dev/null 2>&1
* * * * * cd ${NIDBROOT}/programs; perl > /dev/null 2>&1
* * * * * cd ${NIDBROOT}/programs; perl > /dev/null 2>&1
* * * * * cd ${NIDBROOT}/programs; perl > /dev/null 2>&1
* * * * * cd ${NIDBROOT}/programs; perl > /dev/null 2>&1
* * * * * FSLDIR=${FSLDIR}; PATH=${FSLDIR}/bin:${PATH}; . ${FSLDIR}/etc/fslconf/; export FSLDIR PATH; cd ${NIDBROOT}/programs; perl > /dev/null 2>&1
@hourly find ${NIDBROOT}/programs/logs/*.log -mtime +4 -exec rm {} \;
@daily  /usr/bin/mysqldump nidb -u root -p${MYSQLROOTPASS} | gzip > ${NIDBROOT}/backup/db-\$(date +%Y-%m-%d).sql.gz
@hourly /bin/find /tmp/* -mmin +120 -exec rm -rf {} \;
@daily  find ${NIDBROOT}/ftp/* -mtime +7 -exec rm -rf {} \
@daily  find ${NIDBROOT}/tmp/* -mtime +7 -exec rm -rf {} \;
crontab -u ${NIDBUSER} ~/tempcron.txt && rm ~/tempcron.txt

15) Edit the main configuration.

The main configuration file, ${NIDBROOT}/programs/nidb.cfg, should be edited to reflect your paths, usernames, and passwords. It is this file that will contain the admin password for accessing NiDB. Use the ${NIDBROOT}/programs/nidb.cfg.sample as an example.

Once you have logged in as admin, you can also edit this file again in the database interface, in the menu Admin -> NiDB Settings.

16) (Optional) Install a MySQL/MariaDB frontend.

It will likely increase your productivity when doing maintenance to have a friendly frontend for MySQL/MariaDB. Two popular choices are phpMyAdmin (web-based) and Oracle MySQL Workbench.

For phpMyAdmin:

mv phpMyAdmin-*-english ${WWWROOT}/phpMyAdmin
chmod 755 ${WWWROOT}
cp ${WWWROOT}/phpMyAdmin/ ${WWWROOT}/phpMyAdmin/

For MySQL Workbench, the repositories are listed at this link:

rpm -Uvh mysql57-community-release-el7-11.noarch.rpm
yum install mysql-workbench

However, at the time of this writing, the current version (6.3.10) crashes upon start. The solution is to downgrade:

yum install yum-plugin-versionlock
yum versionlock mysql-workbench-community-6.3.8-1.el7.*
yum install mysql-workbench-community

17) That’s it!

You should by now have a working installation of NiDB, accessible from your web-browser at http://localhost. There are additional pieces you may consider configuring, such as a listener in one of your server ports to bring DICOMs from the scanner in automatically as the images are collected, and also other changes to the database schema and web interface. Now you have a starting point.


For more information on NiDB, see these two papers:


How do we measure thickness, area, and volume of the cerebral cortex?

There are various ways one could estimate morphometric parameters of the cortex, such as its thickness, area, and volume. For example, it is possible to use voxelwise partial volume effects using volume-based representations of the brain, such as in voxel-based morphometry (VBM), in which estimates per voxel become available. Volume-based representations also allow for estimates of thickness, as suggested, for example, by Hutton et al. (2004), or from a surface representation of the cortex, in which it can be measured as a form of distance between the mesh that represents the pia mater (the pial surface) and the mesh that represents the interface between gray and white matter (the white surface).

Here we focus on the surface-based representation as that offers advantages over volume-based representations (Van Essen et al., 1998). Software such as FreeSurfer uses magnetic resonance images to initially construct the white surface. Once that surface has been produced, a copy of it can be offset outwards until tissue contrast in the magnetic resonance image is maximal, which indicates the location of the pial surface. This procedure ensures that both white and pial surfaces have the same topology, with each face and each vertex of the white surface having their matching pair in the pial. This convenience facilitates the computations indicated below.

Cortical surface area

For a triangular face ABC of the surface representation, with vertex coordinates \mathbf{a}=[x_A \; y_A \; z_A]', \mathbf{b}=[x_B \; y_B \; z_B]', and \mathbf{c}=[x_C \; y_C \; z_C]', the area is |\mathbf{u} \times \mathbf{v}|/2, where \mathbf{u} = \mathbf{a}-\mathbf{c}, \mathbf{v} = \mathbf{b}-\mathbf{c}, \times represents the cross product, and the bars |\;| represent the vector norm. Even though such area per face (i.e., facewise) can be used in subsequent steps, most software packages can only deal with values assigned to each vertex (i.e., vertexwise). Conversion from facewise to vertexwise is achieved by assigning to each vertex one-third of the sum of the areas of all faces that meet at that vertex (Winkler et al., 2012).

Cortical thickness

The thickness at each vertex is computed as the average of two distances (Fischl and Dale, 2000; Greve and Fischl, 2018): the first is the distance from each white surface vertex to their corresponding closest point on the pial surface (not necessarily at a pial vertex); the second is the distance from the corresponding pial vertex to the closest point on the white surface (again, not necessarily at a vertex). Other methods are possible, however, see table below (adapted from Lerch and Evans, 2005):

Method Reference
Distance solved using the Laplace’s equation. Jones et al. (2000)
Distance between corresponding vertices. MacDonald et al. (2000)
Distance to the nearest point in the other surface. MacDonald et al. (2000)
Distance to the nearest point in the other surface, computed for both surfaces, then averaged. Fischl and Dale (2000)
Distance along the normal. MacDonald et al. (2000)
Distance along the iteratively computed normal. Lerch and Evans (2005)

Cortical volume

Product method

If the area of either of these surfaces is known, or if the area of a mid-surface, i.e., the surface running half-distance between pial and white surfaces is known, an estimate of the volume can be obtained by multiplying, at each vertex, area by thickness. This procedure is still problematic in that it underestimates the volume of tissue that is external to the convexity of the surface, and overestimates volume that is internal to it; both cases are undesirable, and cannot be solved by merely resorting to using an intermediate surface as the mid-surface.

Figure 1: A diagram in two dimensions of the problem of measuring the cortical volume. If volume is computed using the product method (a), considerable amount of tissue is left unmeasured in the gyri, or measured repeatedly in sulci. The problem is minimised, but not solved, with the use of the mid-surface. In the analytic method (b), vertex coordinates are used to compute the volume of tissue between matching faces of white and pial surfaces, leaving no tissue under- or over-represented.

Analytic method

In Winkler et al. (2018) we propose a different approach to measure volume. Instead of computing the product of thickness and area, we note that any pair of matching faces can be used to define an irregular polyhedron, of which all six coordinates are known from the surface geometry. This polyhedron is an oblique truncated triangular pyramid, which can be perfectly divided into three irregular tetrahedra, which do not overlap, nor leave gaps.

Figure 2: A 3D diagram with the proposed solution to measure the cortical volume. In the surface representation, the cortex is limited internally by the white and externally by the pial surface (a). These two surfaces have matching vertices that can be used to delineate an oblique truncated triangular pyramid (b) and (c). The six vertices of this pyramid can be used to define three tetrahedra, the volumes of which are computed analytically (d).

From the coordinates of the vertices of these tetrahedra, their volumes can be computed analytically, then added together, viz.:

  1. For a given face A_w B_w C_w in the white surface, and its corresponding face A_p B_p C_p in the pial surface, define an oblique truncated triangular pyramid.
  2. Split this truncated pyramid into three tetrahedra, defined as:
    \begin{array}{lcllllll} T_1 &=& (&A_w,&B_w,&C_w,&A_p&)\\ T_2 &=& (&A_p,&B_p,&C_p,&B_w&)\\ T_3 &=& (&A_p,&C_p,&C_w,&B_w&) \end{array}
  3. For each such tetrahedra, let \mathbf{a}, \mathbf{b}, \mathbf{c} and \mathbf{d} represent its four vertices in terms of coordinates [x\;y\;z]'. Compute the volume as |\mathbf{u}\cdot(\mathbf{v} \times \mathbf{w})|/6, where \mathbf{u} = \mathbf{a}-\mathbf{d}, \mathbf{v} = \mathbf{b}-\mathbf{d}, \mathbf{w} = \mathbf{c}-\mathbf{d}, the symbol \times represents the cross product, \cdot represents the dot product, and the bars |\;| represent the vector norm.

No error other than what is intrinsic to the placement of these surfaces is introduced. The resulting volume can be assigned to each vertex in a similar way as conversion from facewise area to vertexwise area. The above method is the default in FreeSurfer 6.0.0.

Is volume at all useful?

Given that volume of the cortex is, ultimately, determined by area and thickness, and these are known to be influenced in general by different factors (Panizzon et al, 2009; Winkler et al, 2010), why would anyone bother in even measuring volume? The answer is that not all factors that can affect the cortex will affect exclusively thickness or area. For example, an infectious process, or the development of a tumor, have potential to affect both. Volume is a way to assess the effects of such non-specific factors on the cortex. However, even in that case there are better alternatives available, namely, the non-parametric combination (NPC) of thickness and area. This use of NPC will be discussed in a future post here in the blog.


The “Group” indicator in FSL

In FSL, when we create a design using the graphical interface in FEAT, or with the command Glm, we are given the opportunity to define, at the higher-level, the “Group” to which each observation belongs. When the design is saved, the information from this setting is stored in a text file named something as “design.grp”. This file, and thus the group setting, takes different roles depending whether the analysis is used in FEAT itself, in PALM, or in randomise.

What can be confusing sometimes is that, in all three cases, the “Group” indicator does not refer to experimental or observational group of any sort. Instead, it refers to variance groups (VG) in FEAT, to exchangeability blocks (EB) in randomise, and to either VG or EB in PALM, depending on whether the file is supplied with the options -vg or -eb.

In FEAT, unless there is reason to suspect (or assume) that the variances for different observations are not equal, all subjects should belong to group “1”. If variance groups are defined, then these are taken into account when the variances are estimated. This is only possible if the design matrix is “separable”, that is, it must be such that, if the observations are sorted by group, the design can be constructed by direct sum (i.e., block-diagonal concatenation) of the design matrices for each group separately. A design is not separable if any explanatory variable (EV) present in the model crosses the group borders (see figure below). Contrasts, however, can encompass variables that are defined across multiple VGs.

The variance groups not necessarily must match the experimental observational groups that may exist in the design (for example, in a comparison of patients and controls, the variance groups may be formed based on the sex of the subjects, or another discrete variable, as opposed to the diagnostic category). Moreover, the variance groups can be defined even if all variables in the model are continuous.

In randomise, the same “Group” setting can be supplied with the option -e design.grp, thus defining exchangeability blocks. Observations within a block can only be permuted with other observations within that same block. If the option --permuteBlocks is also supplied, then the EBs must be of the same size, and the blocks as a whole are instead then permuted. Randomise does not use the concept of variance group, and all observations are always members of the same single VG.

In PALM, using -eb design.grp has the same effect that -e design.grp has in randomise. Further using the option -whole is equivalent to using --permuteBlocks in randomise. It is also possible to use together -whole and -within, meaning that the blocks as a whole are shuffled, and further, observations within block are be shuffled. In PALM the file supplied with the option -eb can have multiple columns, indicating multi-level exchangeability blocks, which are useful in designs with more complex dependence between observations. Using -vg design.grp causes PALM to use the v– or G-statistic, which are replacements for the t– and F-statistics respectively for the cases of heterogeneous variances. Although VG and EB are not the same thing, and may not always match each other, the VGs can be defined from the EBs, as exchangeability implies that some observations must have same variance, otherwise permutations are not possible. The option -vg auto defines the variance groups from the EBs, even for quite complicated cases.

In both FEAT and PALM, defining VGs will only make a difference if such variance groups are not balanced, i.e., do not have the same number of observations, since heteroscedasticity (different variances) only matter in these cases. If the groups have the same size, all subjects can be allocated to a single VG (e.g., all “1”).

Why the maximum statistic?

In brain imaging, each voxel (or vertex, or face, or edge) constitutes a single statistical test. Because thousands such voxels are present in an image, a single experiment results in thousands of statistical tests being performed. The p-value is the probability of finding a test statistic at least as large as the one observed in a given voxel, provided that no effect is present. A p-value of 0.05 indicates that, if an experiment is repeated 20 times and there are no effects, on average one of these repetitions will be considered significant. If thousands of tests are performed, the chance of obtaining a spuriously significant result in at least one voxel increases: if there are 1000 voxels, and at the same test level \alpha = 0.05, we expect, on average, to find 50 significant tests, even in the absence of any effect. This is known as the multiple testing problem. A review of the topic for brain imaging provided in Nichols and Hayasaka (2003) [see references at the end].

To take the multiple testing problem into account, either the test level (\alpha), or the p-values can be adjusted, such that instead of controlling the error rate at each individual test, the error rate is controlled for the whole set (family) of tests. Controlling such family-wise error rate (FWER) ensures that the chance of finding a significant result anywhere in the image is expected to be within a certain predefined level. For example, if there are 1000 voxels, and the FWER-adjusted test level is 0.05, we expect that, if the experiment is repeated for all the voxels 20 times, then on average in one of these repetitions there will be an error somewhere in the image. The adjustment of the p-values or of the test level is done using the distribution of the maximum statistic, something that most readers of this blog are certainly well aware of, as that permeates most of the imaging literature since the early 1990s.

Have you ever wondered why? What is so special about the distribution of the maximum that makes it useful to correct the error rate when there are multiple tests?

Definitions first

Say we have a set of V voxels in an image. For a given voxel v, v \in \{1, \ldots, V\}, with test statistic t_v, the probability that t_v is larger than some cutoff t is denoted by:

\mathsf{P}(t_v > t) = 1 - F_v(t)

where F_v(t) is the cumulative distribution function (cdf) of the test statistic. If the cutoff t is used to accept or reject a hypothesis, then we say that we have a false positive if an observed t_v is larger than t when there is no actual true effect. A false positive is also known as error type I (in this post, the only type of error discussed is of the type I).

For an image (or any other set of tests), if there is an error anywhere, we say that a family-wise error has occurred. We can therefore define a “family-wise null hypothesis” that there is no signal anywhere; to reject this hypothesis, it suffices to have a single, lonely voxel in which t_v > t. With many voxels, the chances of this happening increase, even if there no effect is present. We can, however, adjust our cuttoff t to some other value t_{\text{FWER}} so that the probability of rejecting such family-wise null hypothesis remains within a certain level, say \alpha_{\text{FWER}}.

Union-intersection tests

The “family-wise null hypothesis” is effectively a joint null hypothesis that there is no effect anywhere. That is, it is an union-intersection test (UIT; Roy, 1953). This joint hypothesis is retained if all tests have statistics that are below the significance cutoff. What is the probability of this happening? From the above we know that \mathsf{P}(t_v \leqslant t) = F_v(t). The probability of the same happening for all voxels simultaneously is, therefore, simply the product of such probabilities, assuming of course that the voxels are all independent:

\mathsf{P}(\bigwedge_v t_v \leqslant t) = \prod_v \mathsf{P}(t_v \leqslant t) = \prod_v F_v(t)

Thus, the probability that any voxel has a significant result, which would lead to the occurrence of a family-wise error, is 1-\prod_v F_v(t). If all voxels have identical distribution under the null, then the same is stated as 1- F_v(t)^V.

Distribution of the maximum

Consider the maximum of the set of V voxels, that is, M = \max{(t_v)}. The random variable M is only smaller or equal than some cutoff t if all values t_v are smaller or equal than t. If the voxels are independent, this enables us to derive the cdf of M:

\mathsf{P}(M \leqslant t) = \prod_v \mathsf{P}(t_v \leqslant t) = \prod_v F_v(t).

Thus, the probability that M is larger than some threshold t is 1-\prod_v F_v(t). If all voxels have identical distribution under the null, then the same is stated as 1- F_v(t)^V.

These results, lo and behold, are the same as those used for the UIT above, hence how the distribution of the maximum can be used to control the family-wise error rate (if the distribution of the maximum is computed via permutations, independence is not required).


The above is not the only way in which we can see why the distribution of the maximum allows the control of the family-wise error rate. The work by Marcus, Peritz and Gabriel (1976) showed that, in the context of multiple testing, the null hypothesis for a particular test v can be rejected provided that all possible joint (multivariate) tests done within the set and including v are also significant, and doing so controls the family-wise error rate. For example, if there are four tests, v \in \{1, 2, 3, 4\}, the test in v=1 is considered significant if the joint tests using (1,2,3,4), (1,2,3), (1,2,4), (1,3,4), (1,2), (1,3), (1,4) and (1) are all significant (that is, all that include v=1). Such joint test can be quite much any valid test, including Hotelling’s T^2, MANOVA/MANCOVA, or NPC (Non-Parametric Combination), all of which are based on recomputing the test statistic from the original data, or others, based on the test statistics or p-values of each of the elementary V tests, as in a meta-analysis.

Such closed testing procedure (CTP) incurs an additional problem, though: the number of joint tests that needs to be done is 2^V-1, which in imaging applications renders them unfeasible. However, there is one particular joint test that provides a direct algorithmic shortcut: using the \max(t_v) as the test statistic for the joint test. The maximum across all V tests is also the maximum for any subset of tests, such that these can be skipped altogether. This gives a vastly efficient algorithmic shortcut to a CTP, as shown by Westfall and Young (1993).

Simple intuition

One does not need to chase the original papers cited above (although doing so cannot hurt). Broadly, the same can be concluded based solely on intuition: if the distribution of some test statistic that is not the distribution of the maximum within an image were used as the reference to compute the (FWER-adjusted) p-values at a given voxel v, then the probability of finding a voxel with a test statistic larger than t_v anywhere could not be determined: there could always be some other voxel v', with an even larger statistic (i.e., t_{v'} > t_v), but the probability of such happening would not be captured by the distribution of a non-maximum. Hence the chance of finding a significant voxel anywhere in the image under the null hypothesis (the very definition of FWER) would not be controlled. Using the absolute maximum eliminates this logical leakage.


  • Marcus R, Peritz E, Gabriel KR. On closed testing pocedures with special reference to ordered analysis of variance. Biometrika. 1976 Dec;63(3):655.
  • Nichols T, Hayasaka S. Controlling the familywise error rate in functional neuroimaging: a comparative review. Stat Methods Med Res. 2003 Oct;12(5):419–46.
  • Roy SN. On a heuristic method of test construction and its use in multivariate analysis. Ann Math Stat. 1953 Jun;24(2):220–38.
  • Westfall PH, Young SS. Resampling-based multiple testing: examples and methods for p-value adjustment. New York, Wiley, 1993.

Better statistics, faster

Faster permutation inference

Permutation tests are more robust and help to make scientific results more reproducible by depending on fewer assumptions. However, they are computationally intensive as recomputing a model thousands of times can be slow. The purpose of this post is to briefly list some options available for speeding up permutation.

Firstly, no speed-ups may be needed: for small sample sizes, or low resolutions, or small regions of interest, a permutation test can run in a matter of minutes. For larger data, however, accelerations may be of use. One option is acceleration through parallel processing or GPUs (for example applications of the latter, see Eklund et al., 2012, Eklund et al., 2013 and Hernández et al., 2013; references below), though this does require specialised implementation. Another option is to reduce the computational burden by exploiting the properties of the statistics and their distributions. A menu of options includes:

  • Do few permutations (shorthand name: fewperms). The results remain valid on average, although the p-values will have higher variability.
  • Keep permuting until a fixed number of permutations with statistic larger than the unpermuted is found (a.k.a., negative binomial; shorthand name: negbin).
  • Do a few permutations, then approximate the tail of the permutation distribution by fitting a generalised Pareto distribution to its tail (shorthand name: tail).
  • Approximate the permutation distribution with a gamma distribution, using simple properties of the test statistic itself, amazingly not requiring any permutations at all (shorthand name: noperm).
  • Do a few permutations, then approximate the full permutation distribution by fitting a gamma distribution (shorthand name: gamma).
  • Run permutations on only a few voxels, then fill the missing ones using low-rank matrix completion theory (shorthand name: lowrank).

These strategies allow accelerations >100x, yielding nearly identical results as in the non-accelerated case. Some, such as tail approximation, are generic enough to be used nearly all the most common scenarios, including univariate and multivariate tests, spatial statistics, and for correction for multiple testing.

In addition to accelerating permutation tests, some of these strategies, such as tail and noperm, allow continuous p-values to be found, and refine the p-values far into the tail of the distribution, thus avoiding the usual discreteness of p-values, which can be a problem in some applications if too few permutations are done.

These methods are available in the tool PALM — Permutation Analysis of Linear Models — and the complete description, evaluation, and application to the re-analysis of a voxel-based morphometry study (Douaud et al., 2007) have been just published in Winkler et al., 2016 (for the Supplementary Material, click here). The paper includes a flow chart prescribing these various approaches for each case, reproduced below.

Faster permutation inference

The hope is that these accelerations will facilitate the use of permutation tests and, if used in combination with hardware and/or software improvements, can further expedite computation leaving little reason not to use these tests.


Contributed to this post: Tom Nichols, Ged Ridgway.