Why the maximum statistic?

In brain imaging, each voxel (or vertex, or face, or edge) constitutes a single statistical test. Because thousands such voxels are present in an image, a single experiment results in thousands of statistical tests being performed. The p-value is the probability of finding a test statistic at least as large as the one observed in a given voxel, provided that no effect is present. A p-value of 0.05 indicates that, if an experiment is repeated 20 times and there are no effects, on average one of these repetitions will be considered significant. If thousands of tests are performed, the chance of obtaining a spuriously significant result in at least one voxel increases: if there are 1000 voxels, and at the same test level \alpha = 0.05, we expect, on average, to find 50 significant tests, even in the absence of any effect. This is known as the multiple testing problem. A review of the topic for brain imaging provided in Nichols and Hayasaka (2003) [see references at the end].

To take the multiple testing problem into account, either the test level (\alpha), or the p-values can be adjusted, such that instead of controlling the error rate at each individual test, the error rate is controlled for the whole set (family) of tests. Controlling such family-wise error rate (FWER) ensures that the chance of finding a significant result anywhere in the image is expected to be within a certain predefined level. For example, if there are 1000 voxels, and the FWER-adjusted test level is 0.05, we expect that, if the experiment is repeated for all the voxels 20 times, then on average in one of these repetitions there will be an error somewhere in the image. The adjustment of the p-values or of the test level is done using the distribution of the maximum statistic, something that most readers of this blog are certainly well aware of, as that permeates most of the imaging literature since the early 1990s.

Have you ever wondered why? What is so special about the distribution of the maximum that makes it useful to correct the error rate when there are multiple tests?

Definitions first

Say we have a set of V voxels in an image. For a given voxel v, v \in \{1, \ldots, V\}, with test statistic t_v, the probability that t_v is larger than some cutoff t is denoted by:

\mathsf{P}(t_v > t) = 1 - F_v(t)

where F_v(t) is the cumulative distribution function (cdf) of the test statistic. If the cutoff t is used to accept or reject a hypothesis, then we say that we have a false positive if an observed t_v is larger than t when there is no actual true effect. A false positive is also known as error type I (in this post, the only type of error discussed is of the type I).

For an image (or any other set of tests), if there is an error anywhere, we say that a family-wise error has occurred. We can therefore define a “family-wise null hypothesis” that there is no signal anywhere; to reject this hypothesis, it suffices to have a single, lonely voxel in which t_v > t. With many voxels, the chances of this happening increase, even if there no effect is present. We can, however, adjust our cuttoff t to some other value t_{\text{FWER}} so that the probability of rejecting such family-wise null hypothesis remains within a certain level, say \alpha_{\text{FWER}}.

Union-intersection tests

The “family-wise null hypothesis” is effectively a joint null hypothesis that there is no effect anywhere. That is, it is an union-intersection test (UIT; Roy, 1953). This joint hypothesis is retained if all tests have statistics that are below the significance cutoff. What is the probability of this happening? From the above we know that \mathsf{P}(t_v \leqslant t) = F_v(t). The probability of the same happening for all voxels simultaneously is, therefore, simply the product of such probabilities, assuming of course that the voxels are all independent:

\mathsf{P}(\bigwedge_v t_v \leqslant t) = \prod_v \mathsf{P}(t_v \leqslant t) = \prod_v F_v(t)

Thus, the probability that any voxel has a significant result, which would lead to the occurrence of a family-wise error, is 1-\prod_v F_v(t). If all voxels have identical distribution under the null, then the same is stated as 1- F_v(t)^V.

Distribution of the maximum

Consider the maximum of the set of V voxels, that is, M = \max{(t_v)}. The random variable M is only smaller or equal than some cutoff t if all values t_v are smaller or equal than t. If the voxels are independent, this enables us to derive the cdf of M:

\mathsf{P}(M \leqslant t) = \prod_v \mathsf{P}(t_v \leqslant t) = \prod_v F_v(t).

Thus, the probability that M is larger than some threshold t is 1-\prod_v F_v(t). If all voxels have identical distribution under the null, then the same is stated as 1- F_v(t)^V.

These results, lo and behold, are the same as those used for the UIT above, hence how the distribution of the maximum can be used to control the family-wise error rate (if the distribution of the maximum is computed via permutations, independence is not required).

Closure

The above is not the only way in which we can see why the distribution of the maximum allows the control of the family-wise error rate. The work by Marcus, Peritz and Gabriel (1976) showed that, in the context of multiple testing, the null hypothesis for a particular test v can be rejected provided that all possible joint (multivariate) tests done within the set and including v are also significant, and doing so controls the family-wise error rate. For example, if there are four tests, v \in \{1, 2, 3, 4\}, the test in v=1 is considered significant if the joint tests using (1,2,3,4), (1,2,3), (1,2,4), (1,3,4), (1,2), (1,3), (1,4) and (1) are all significant (that is, all that include v=1). Such joint test can be quite much any valid test, including Hotelling’s T^2, MANOVA/MANCOVA, or NPC (Non-Parametric Combination), all of which are based on recomputing the test statistic from the original data, or others, based on the test statistics or p-values of each of the elementary V tests, as in a meta-analysis.

Such closed testing procedure (CTP) incurs an additional problem, though: the number of joint tests that needs to be done is 2^V-1, which in imaging applications renders them unfeasible. However, there is one particular joint test that provides a direct algorithmic shortcut: using the \max(t_v) as the test statistic for the joint test. The maximum across all V tests is also the maximum for any subset of tests, such that these can be skipped altogether. This gives a vastly efficient algorithmic shortcut to a CTP, as shown by Westfall and Young (1993).

Simple intuition

One does not need to chase the original papers cited above (although doing so cannot hurt). Broadly, the same can be concluded based solely on intuition: if the distribution of some test statistic that is not the distribution of the maximum within an image were used as the reference to compute the (FWER-adjusted) p-values at a given voxel v, then the probability of finding a voxel with a test statistic larger than t_v anywhere could not be determined: there could always be some other voxel v', with an even larger statistic (i.e., t_{v'} > t_v), but the probability of such happening would not be captured by the distribution of a non-maximum. Hence the chance of finding a significant voxel anywhere in the image under the null hypothesis (the very definition of FWER) would not be controlled. Using the absolute maximum eliminates this logical leakage.

References

  • Marcus R, Peritz E, Gabriel KR. On closed testing pocedures with special reference to ordered analysis of variance. Biometrika. 1976 Dec;63(3):655.
  • Nichols T, Hayasaka S. Controlling the familywise error rate in functional neuroimaging: a comparative review. Stat Methods Med Res. 2003 Oct;12(5):419–46.
  • Roy SN. On a heuristic method of test construction and its use in multivariate analysis. Ann Math Stat. 1953 Jun;24(2):220–38.
  • Westfall PH, Young SS. Resampling-based multiple testing: examples and methods for p-value adjustment. New York, Wiley, 1993.

Better statistics, faster

Faster permutation inference

Permutation tests are more robust and help to make scientific results more reproducible by depending on fewer assumptions. However, they are computationally intensive as recomputing a model thousands of times can be slow. The purpose of this post is to briefly list some options available for speeding up permutation.

Firstly, no speed-ups may be needed: for small sample sizes, or low resolutions, or small regions of interest, a permutation test can run in a matter of minutes. For larger data, however, accelerations may be of use. One option is acceleration through parallel processing or GPUs (for example applications of the latter, see Eklund et al., 2012, Eklund et al., 2013 and Hernández et al., 2013; references below), though this does require specialised implementation. Another option is to reduce the computational burden by exploiting the properties of the statistics and their distributions. A menu of options includes:

  • Do few permutations (shorthand name: fewperms). The results remain valid on average, although the p-values will have higher variability.
  • Keep permuting until a fixed number of permutations with statistic larger than the unpermuted is found (a.k.a., negative binomial; shorthand name: negbin).
  • Do a few permutations, then approximate the tail of the permutation distribution by fitting a generalised Pareto distribution to its tail (shorthand name: tail).
  • Approximate the permutation distribution with a gamma distribution, using simple properties of the test statistic itself, amazingly not requiring any permutations at all (shorthand name: noperm).
  • Do a few permutations, then approximate the full permutation distribution by fitting a gamma distribution (shorthand name: gamma).
  • Run permutations on only a few voxels, then fill the missing ones using low-rank matrix completion theory (shorthand name: lowrank).

These strategies allow accelerations >100x, yielding nearly identical results as in the non-accelerated case. Some, such as tail approximation, are generic enough to be used nearly all the most common scenarios, including univariate and multivariate tests, spatial statistics, and for correction for multiple testing.

In addition to accelerating permutation tests, some of these strategies, such as tail and noperm, allow continuous p-values to be found, and refine the p-values far into the tail of the distribution, thus avoiding the usual discreteness of p-values, which can be a problem in some applications if too few permutations are done.

These methods are available in the tool PALM — Permutation Analysis of Linear Models — and the complete description, evaluation, and application to the re-analysis of a voxel-based morphometry study (Douaud et al., 2007) have been just published in Winkler et al., 2016 (for the Supplementary Material, click here). The paper includes a flow chart prescribing these various approaches for each case, reproduced below.

Faster permutation inference

The hope is that these accelerations will facilitate the use of permutation tests and, if used in combination with hardware and/or software improvements, can further expedite computation leaving little reason not to use these tests.

References

Contributed to this post: Tom Nichols, Ged Ridgway.

Three HCP utilities

If you are working with data from the Human Connectome Project (HCP), perhaps these three small Octave/MATLAB utilities may be of some use:

  • hcp2blocks.m: Takes the restricted file with information about kinship and zygosity and produces a multi-level exchangeability blocks file that can be used with PALM for permutation inference. It is fully described here.
  • hcp2solar.m: Takes restricted and unrestricted files to produce a pedigree file that can be used with SOLAR for heritability and genome-wide association analyses.
  • picktraits.m: Takes either restricted or unrestricted files, a list of traits and a list of subject IDs to produce tables with selected traits for the selected subjects. These can be used to, e.g., produce design matrices for subsequent analysis.

These functions need to parse relatively large CSV files, which is somewhat inefficient in MATLAB and Octave. Still, since these commands usually have to be executed only once for a particular analysis, a 1-2 minute wait seems acceptable.

If downloaded directly from the above links, remember also to download the prerequisites: strcsvread.m and strcsvwrite.m. Alternatively, clone the full repository from GitHub. The link is this. Other tools may be added in the future.

Downsampling (decimating) a brain surface

Downsampled average cortical surfaces at different iterations (n), with the respective number of vertices (V), edges (E) and faces (F).

In the previous post, a method to display brain surfaces interactively in PDF documents was presented. While the method is already much more efficient than it was when it first appeared some years ago, the display of highly resolved meshes can be computationally intensive, and may make even the most enthusiastic readers give up even opening the file.

If the data being shown has low spatial frequency, an alternative way to display, which preserves generally the amount of information, is to decimate the mesh, downsampling it to a lower resolution. Although in theory this can be done in the native (subject-level) geometry through retessellation (i.e., interpolation of coordinates), the interest in downsampling usually is at the group level, in which case the subjects have all been interpolated to a common grid, which in general is a geodesic sphere produced by subdividing recursively an icosahedron (see this earlier post). If, at each iteration, the vertices and faces are added in a certain order (such as in FreeSurfer‘s fsaverage or in the one generated with the platonic command), downsampling is relatively straightforward, whatever is the type of data.

Vertexwise data

For vertexwise data, downsampling can be based on the fact that vertices are added (appended) in a certain order as the icosahedron is constructed:

  • Vertices 1-12 correspond to n = 0, i.e., no subdivision, or ico0.
  • Vertices 13-42 correspond to the vertices that, once added to the ico0, make it ico1 (first iteration of subdivision, n = 1).
  • Vertices 43-162 correspond to the vertices that, once added to ico1, make it ico2 (second iteration, n = 2).
  • Vertices 163-642, likewise, make ico3.
  • Vertices 643-2562 make ico4.
  • Vertices 2563-10242 make ico5.
  • Vertices 10243-40962 make ico6, etc.

Thus, if the data is vertexwise (also known as curvature, such as cortical thickness or curvature indices proper), the above information is sufficient to downsample the data: to reduce down to an ico3, for instance, all what one needs to do is to pick the vertices 1 through 642, ignoring 643 onwards.

Facewise data

Data stored at each face (triangle) generally correspond to areal quantities, that require mass conservation. For both fsaverage and platonic icosahedrons, the faces are added in a particular order such that, at each iteration of the subdivision, a given face index is replaced in situ for four other faces: one can simply collapse (via sum or average) the data of every four faces into a new one.

Surface geometry

If the objective is to decimate the surface geometry, i.e., the mesh itself, as opposed to quantities assigned to vertices or faces, one can use similar steps:

  1. Select the vertices from the first up to the last vertex of the icosahedron in the level needed.
  2. Iteratively downsample the face indices by selecting first those that are formed by three vertices that were appended for the current iteration, then for those that have two vertices appended in the current iteration, then connecting the remaining three vertices to form a new, larger face.

Applications

Using downsampled data is useful not only to display meshes in PDF documents, but also, some analyses may not require a high resolution as the default mesh (ico7), particularly for processes that vary smoothly across the cortex, such as cortical thickness. Using a lower resolution mesh can be just as informative, while operating at a fraction of the computational cost.

A script

A script that does the tasks above using Matlab/Octave is here: icodown.m. It is also available as part of the areal package described here, which also satisfies all its dependencies. Input and output formats are described here.

Interactive 3D brains in PDF documents

A screenshot from Acrobat Reader. The example file is here.

Would it not be helpful to be able to navigate through tri-dimensional, surface-based representations of the brain when reading a paper, without having to download separate datasets, or using external software? Since 2004, with the release of the version 1.6 of the Portable Document Format (PDF), this has been possible. However, the means to generate the file were not easily available until about 2008, when Intel released of a set of libraries and tools. This still did not help much to improve popularity, as in-document rendering of complex 3D models requires a lot of memory and processing, making its use difficult in practice at the time. The fact that Acrobat Reader was a lot bloated did not help much either.

Now, almost eight years later, things have become easier for users who want to open these documents. Newer versions of Acrobat are much lighter, and capabilities of ordinary computers have increased. Yet, it seems the interest on this kind of visualisation have faded. The objective of this post is to show that it is remarkably simple to have interactive 3D objects in PDF documents, which can be used in any document published online, including theses, presentations, and papers: journals as PNAS and Journal of Neuroscience are at the forefront in accepting interactive manuscripts.

Requirements

  • U3D Tools: Make sure you have the IDTFConverter utility, from the U3D tools, available on SourceForge as part of the MathGL library. A direct link to version 1.4.4 is here; an alternative link, of a repackaged version of the same, is here. Compiling instructions for Linux and Mac are in the “readme” file. There are some dependencies that must be satisfied, and are described in the documentation. If you decide not to install the U3D tools, but only compile them, make sure the path of the executable is both in the $PATH and in the $LD_LIBRARY_PATH. This can be done with:
cd /path/to/the/directory/of/IDTFConverter
export PATH=${PATH}:$(pwd)
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:$(pwd)
  • The ply2idtf function: Make sure you have the latest version of the areal package, which contains the MATLAB/Octave function ply2idtf.m used below.
  • Certain LaTeX packages: The packages movie15 or media9, that allow embedding the 3D object into the PDF using LaTeX. Either will work. Below it is assumed the older, movie15 package, is used.

Step 1: Generate the PLY maps

Once you have a map of vertexwise cortical data that needs to be shown, follow the instructions from this earlier blog post that explains how to generate Stanford PLY files to display colour-coded vertexwise data. These PLY files will be used below.

Step 2: Convert the PLY to IDTF files

IDTF stands for Intermediate Data Text Format. As the name implies, it is a text, intermediate file, used as a step before the creation of the U3D files, the latter that are embedded into the PDF. Use the function ply2idtf for this:

ply2idtf(...
   {'lh.pial.thickness.avg.ply','LEFT', eye(4);...
    'rh.pial.thickness.avg.ply','RIGHT',eye(4)},...
    'thickness.idtf');

The first argument is a cell array with 3 columns, and as many rows as PLY files being added to the IDTF file. The first column contains the file name, the second the label (or node) that for that file, and the third an affine matrix that maps the coordinates from the PLY file to the world coordinate system of the (to be created) U3D. The second (last) argument to the command is the name of the output file.

Step 3: Convert the IDTF to U3D files

From a terminal window (not MATLAB or Octave), run:

IDTFConverter -input thickness.idtf -output thickness.u3d

Step 4: Configure default views

Here we use the older movie15 LaTeX package, and the same can be accomplished with the newer, media9 package. Various viewing options are configurable, all of which are described in the documentation. These options can be saved in a text file with extension .vws, and later supplied in the LaTeX document. An example is below.

VIEW=Both Hemispheres
  COO=0 -14 0,
  C2C=-0.75 0.20 0.65
  ROO=325
  AAC=30
  ROLL=-0.03
  BGCOLOR=.5 .5 .5
  RENDERMODE=Solid
  LIGHTS=CAD
  PART=LEFT
    VISIBLE=true
  END
  PART=RIGHT
    VISIBLE=true
  END
END
VIEW=Left Hemisphere
  COO=0 -14 0,
  C2C=-1 0 0
  ROO=325
  AAC=30
  ROLL=-0.03
  BGCOLOR=.5 .5 .5
  RENDERMODE=Solid
  LIGHTS=CAD
  PART=LEFT
    VISIBLE=true
  END
  PART=RIGHT
    VISIBLE=false
  END
END
VIEW=Right Hemisphere
  COO=0 -14 0,
  C2C=1 0 0
  ROO=325
  AAC=30
  ROLL=0.03
  BGCOLOR=.5 .5 .5
  RENDERMODE=Solid
  LIGHTS=CAD
  PART=LEFT
    VISIBLE=false
  END
  PART=RIGHT
    VISIBLE=true
  END
END

Step 5: Add the U3D to the LaTeX source

Interactive, 3D viewing is unfortunately not supported by most PDF readers. However, it is supported by the official Adobe Acrobat Reader since version 7.0, including the recent version DC. Thus, it is important to let the users/readers of the document know that they must open the file using a recent version of Acrobat. This can be done in the document itself, using a message placed with the option text of the \includemovie command of the movie15 package. A minimalistic LaTeX source is shown below (it can be downloaded here).

\documentclass{article}

% Relevant package:
\usepackage[3D]{movie15}

% pdfLaTeX and color links setup:
\usepackage{color}
\usepackage[pdftex]{hyperref}
\definecolor{colorlink}{rgb}{0, 0, .6}  % dark blue
\hypersetup{colorlinks=true,citecolor=colorlink,filecolor=colorlink,linkcolor=colorlink,urlcolor=colorlink}

\begin{document}
\title{Interactive 3D brains in PDF documents}
\author{}
\date{}
\maketitle

\begin{figure}[!h]
\begin{center}
\includemovie[
text=\fbox{\parbox[c][9cm][c]{9cm}{\centering {\footnotesize (Use \href{http://get.adobe.com/reader/}{Adobe Acrobat Reader 7.0+} \\to view the interactive content.)}}},
poster,label=Average,3Dviews2=pial.vws]{10cm}{10cm}{thickness.u3d}
\caption{An average 3D brain, showing colour-coded average thickness (for simplicity, colour scale not shown). Click to rotate. Right-click to for a menu with various options. Details at \href{http://brainder.org}{http://brainder.org}.}
\end{center}
\end{figure}

\end{document}

Step 6: Generate the PDF

For LaTeX, use pdfLaTeX as usual:

pdflatex document.tex

What you get

After generating the PDF, the result of this example is shown here (a screenshot is at the top). It is possible to rotate in any direction, zoom, pan, change views to predefined modes, and alternate between orthogonal and perspective projections. It’s also possible to change rendering modes (including transparency), and experiment with various lightning options.

In Acrobat Reader, by right-clicking, a menu with various useful options is presented. A toolbar (as shown in the top image) can also be enabled through the menu.

The same strategy works also with the Beamer class, such that interactive slides can be created and used in talks, and with XeTeX, allowing these with a richer variety of text fonts.

See also

  • Wikipedia has an article on U3D files.
  • Alexandre Gramfort has developed a set of tools that cover quite much the same as above. It’s freely available in Matlab FileExchange.
  • To display molecules interactively (including proteins), the steps are similar. Instructions for Jmol and Pymol are available.
  • Commercial products offering features that build on these resources are also available.

Extreme value notes

Extreme values are useful to quantify the risk of catastrophic floods, and much more.

This is a brief set of notes with an introduction to extreme value theory. For reviews, see Leadbetter et al (1983) and David and Huser (2015) [references at the end]. Also of some (historical) interest might be the classical book by Gumbel (1958). Let X_1, \dots, X_n be a sequence of independent and identically distributed variables with cumulative distribution function (cdf) F(x) and let M_n =\max(X_1,\dots,X_n) denote the maximum.

If F(x) is known, the distribution of the maximum is:

\begin{array}{lll} P(M_n \leqslant x) &=&P(X_1 \leqslant x, \dots, X_n \leqslant x) \\ &=& P(X_1 \leqslant x) \cdots P(X_n \leqslant x) = F^n(x). \end{array}

The distribution function F(x) might, however, not be known. If data are available, it can be estimated, although small errors on the estimation of F(x) can lead to large errors concerning the extreme values. Instead, an asymptotic result is given by the extremal types theorem, also known as Fisher-Tippett-Gnedenko Theorem, First Theorem of Extreme Values, or extreme value trinity theorem (called under the last name by Picklands III, 1975).

But before that, let’s make a small variable change. Working with M_n directly is problematic because as n \rightarrow \infty, F^n(x) \rightarrow 0. Redefining the problem as a function of M_n^* = \frac{M_n-b_n}{a_n} renders treatment simpler. The theorem can be stated then as: If there exist sequences of constants a_n \in \mathbb{R}_{+} and b_n \in \mathbb{R} such that, as n \rightarrow \infty:

P\left(M_{n}^{*} \leqslant x \right) \rightarrow G(x)

then G(x) belongs to one of three “domains of attraction”:

  • Type I (Gumbel law): \Lambda(x) = e^{-e^{-\frac{x-b}{a}}}, for x \in \mathbb{R} indicating that the distribution of M_n has an exponential tail.
  • Type II (Fréchet law): \Phi(x) = \begin{cases} 0 & x\leqslant b \\ e^{-\left(\frac{x-b}{a}\right)^{-\alpha}} & x\; \textgreater\; b \end{cases} indicating that the distribution of M_n has a heavy tail (including polynomial decay).
  • Type III (Weibull law): \Psi(x) = \begin{cases} e^{-\left( -\frac{x-b}{a}\right)^\alpha} & x\;\textless\; b \\ 1 & x\geqslant b \end{cases} indicating that the distribution of M_n has a light tail with finite upper bound.

Note that in the above formulation, the Weibull is reversed so that the distribution has an upper bound, as opposed to a lower one as in the Weibull distribution. Also, the parameterisation is slightly different than the one usually adopted for the Weibull distribution.

These three families have parameters a\; \textgreater\; 0, b and, for families II and III, \alpha\; \textgreater\; 0. To which of the three a particular F(x) is attracted is determined by the behaviour of the tail of of the distribution for large x. Thus, we can infer about the asymptotic properties of the maximum while having only a limited knowledge of the properties of F(x).

These three limiting cases are collectively termed extreme value distributions. Types II and III were identified by Fréchet (1927), whereas type I was found by Fisher and Tippett (1928). In his work, Fréchet used M_n^* = \frac{M_n}{a_n}, whereas Fisher and Tippett used M_n^* = \frac{M_n-b_n}{a_n}. Von Mises (1936) identified various sufficient conditions for convergence to each of these forms, and Gnedenko (1943) established a complete generalisation.

Generalised extreme value distribution

As shown above, the rescaled maxima converge in distribution to one of three families. However, all are cases of a single family that can be represented as:

G(x) = e^{-\left(1-\xi\left(\frac{x-\mu}{\sigma}\right)\right)^{\frac{1}{\xi}}}

defined on the set \left\{x:1-\xi\frac{x-\mu}{\sigma}\;\textgreater\;0\right\}, with parameters -\infty \;\textless \;\mu\;\textless\; \infty (location), \sigma\;\textgreater\;0 (scale), and -\infty\;\textless\;\xi\;\textless\;\infty (shape). This is the generalised extreme value (GEV) family of distributions. If \xi \rightarrow 0, it converges to Gumbel (type I), whereas if \xi < 0 it corresponds to Fréchet (type II), and if \xi\;\textgreater\;0 it corresponds to Weibull (type III). Inference on \xi allows choice of a particular family for a given problem.

Generalised Pareto distribution

For u\rightarrow\infty, the limiting distribution of a random variable Y=X-u, conditional on X \;\textgreater\; u, is:

H(y) = 1-\left(1-\frac{\xi y}{\tilde{\sigma}}\right)^{\frac{1}{\xi}}

defined for y \;\textgreater\; 0 and \left(1-\frac{\xi y}{\tilde{\sigma}}\right) \;\textgreater\; 0. The two parameters are the \xi (shape) and \tilde{\sigma} (scale). The shape corresponds to the same parameter \xi of the GEV, whereas the scale relates to the scale of the former as \tilde{\sigma}=\sigma-\xi(u-\mu).

The above is sometimes called the Picklands-Baikema-de Haan theorem or the Second Theorem of Extreme Values, and it defines another family of distributions, known as generalised Pareto distribution (GPD). It generalises an exponential distribution with parameter \frac{1}{\tilde{\sigma}} as \xi \rightarrow 0, an uniform distribution in the interval \left[0, \tilde{\sigma}\right] when \xi = 1, and a Pareto distribution when \xi \;\textgreater\; 0.

Parameter estimation

By restricting the attention to the most common case of -\frac{1}{2}<\xi<\frac{1}{2}, which represent distributions approximately exponential, parametters for the GPD can be estimated using at least three methods: maximum likelihood, moments, and probability-weighted moments. These are described in Hosking and Wallis (1987). For \xi outside this interval, methods have been discussed elsewhere (Oliveira, 1984). The method of moments is probably the simplest, fastest and, according to Hosking and Wallis (1987) and Knijnenburg et al (2009), has good performance for the typical cases of -\frac{1}{2}<\xi<\frac{1}{2}.

For a set of extreme observations, let \bar{x} and s^2 be respectively the sample mean and variance. The moment estimators of \tilde{\sigma} and \xi are \hat{\tilde{\sigma}} = \frac{\bar{x}}{2}\left(\frac{\bar{x}^2}{s^2}+1\right) and \hat{\xi}=\frac{1}{2}\left(\frac{\bar{x}^2}{s^2}-1\right).

The accuracy of these estimates can be tested with, e.g., the Anderson-Darling goodness-of-fit test (Anderson and Darling, 1952; Choulakian and Stephens, 2001), based on the fact that, if the modelling is accurate, the p-values for the distribution should be uniformly distributed.

Availability

Statistics of extremes are used in PALM as a way to accelerate permutation tests. More details to follow soon.

References

The figure at the top (flood) is in public domain.

Non-Parametric Combination (NPC) for brain imaging

Have you ever had an analysis in which there was a large set of contrasts, all of interest, and you were worried about multiple testing? An eventual effect would be missed by a simple Bonferroni correction, but you did not know what else to do? Or did you have a set of different studies and you wished to obtain a style of meta-analytic result, indicating whether there would be evidence across all of them, without requiring the studies to be all consistently significant?

The Non-Parametric Combination (NPC) solves these issues. It is a way of performing joint inference on multiple data collected on the same experimental units (e.g., same subjects), all with minimal assumptions. The method was proposed originally by Pesarin (1990, 1992) [see references below], independently by Blair and Karninski (1993), and described extensively by Pesarin and Salmaso (2010). In this blog entry, the NPC is presented in brief, with emphasis on the modifications we introduce to render it feasible for brain imaging. The complete details are in our paper that has just been published in the journal Human Brain Mapping.

NPC in a nutshell

The NPC consists of, in a first phase, testing each hypothesis separately using permutations that are performed synchronously across datasets; these tests are termed partial tests. The resulting statistics for each and every permutation are recorded, allowing an estimate of the complete empirical null distribution to be constructed for each one. In a second phase, the empirical p-values for each statistic are combined, for each permutation, into a joint statistic. As such a combined joint statistic is produced from the previous permutations, an estimate of its empirical distribution function is immediately known, and so is the p-value of the joint test. A flowchart of the original algorithm is shown below; click to see it side-by-side with the modified one (described below).

A host of combining functions

The null hypothesis of the NPC is that null hypotheses for all partial tests are true, and the alternative hypothesis that any is false, which is the same null of a union-intersection test (UIT; Roy, 1953). The rejection region depends on how the combined statistic is produced. Various combining functions, which produce such combined statistics, can be considered, and some of the most well known are listed in the table below:

Method Statistic p-value
Tippett \min \left(p_{k}\right) 1-\left(1-T\right)^{K}
Fisher -2 \sum_{k=1}^{K} \ln\left(p_{k}\right) 1-\chi^{2}\left(T;\;\nu=2K\right)
Stouffer \frac{1}{\sqrt{K}} \sum_{k=1}^{K} \Phi^{-1}\left(1-p_{k}\right) 1-\Phi\left(T;\;\mu=0,\;\sigma^2=1\right)
Mudholkar–George \frac{1}{\pi}\sqrt{\frac{3(5K+4)}{K(5K+2)}}\sum_{k=1}^{K} \ln\left(\frac{1-p_{k}}{p_{k}}\right) 1-t_{\text{cdf}}(T;\;\nu=5K+4)

In the table, K is the number of partial tests, and the remaining of the variables follow the usual notation (see the Table 1 in the paper for the complete description). Many of these combining functions were proposed over the years for applications such as meta-analyses, and many of them assume independence between the tests being combined, and will give incorrect p-values if such assumption is not met. In the NPC, lack of dependence is not a problem, even if these same functions are used: the synchronised permutations ensure that any dependence, if existing, is taken into account, and this is done so implicitly, with no need for explicit modelling.

The different combining functions lead to different rejection regions for the null hypothesis. For the four combining functions in the table above, the respective rejection regions are in the figure below.

The combining functions can be modified to allow combination of tests so as to favour hypotheses with concordant directions, or be modified for bi-directional tests. Click on the figure above for examples of these cases (again, see the paper for the complete details).

Two problems, one solution

The multiple testing problem is well known in brain imaging: as an image comprises thousands of voxels/vertices/faces, correction is necessary. Bonferroni is in general too conservative, and various other approaches have been proposed, such as the random field theory. Permutation tests provide control over the familywise error rate (FWER) for the multiple tests across space, requiring only the assumption of exchangeability. This is all well known; see Nichols and Hayasaka (2003) and Winkler et al. (2014) for details.

However, another type of multiple testing is also common: analyses that test multiple hypotheses using the same model, multiple pairwise group comparisons, multiple and distinct models, studies using multiple modalities, that mix imaging and non-imaging data, that consider multiple processing pipelines, and even multiple multivariate analyses. All these common cases also need multiple testing correction. We call this multiple testing problem MTP-II, to discern it from the well known multiple testing problem across space, described above, which we term MTP-I.

One of the many combining functions possible with NPC, the one proposed by Tippett (1931), has a further property that makes it remarkably interesting. The Tippett function uses the smallest p-value across partial tests as its test statistic. Alternatively, if all statistics are comparable, it can be formulated in terms of the maximum statistic. It turns out that the distribution of the maximum statistic across a set of tests is also the distribution that can be used in a closed testing procedure (Marcus et al., 1976) to correct for the familywise error rate (FWER) using resampling methods, such as permutation. In the context of joint inference, FWER-correction can also be seen as an UIT. Thus, NPC offers a link between combination of multiple tests, and correction for multiple tests, in both cases regardless of any dependence between such tests.

This means that the MTP-II, for which correction in the parametric realm is either non-existing or fiendishly difficult, can be accommodated easily. It requires no explicit modelling of the dependence between the tests, and the resulting error rates are controlled exactly at the test level, adding rigour to what otherwise could lead to an excess of false positives without correction, or be overly conservative if a naïve correction such as Bonferroni were attempted.

Modifying for imaging applications

As originally proposed, in practice NPC cannot be used in brain imaging. As the statistics for all partial tests for all permutations need to be recorded, an enormous amount of space for data storage is necessary. Even if storage space were not a problem, the discreteness of the p-values for the partial tests is problematic when correcting for multiple testing, because with thousands of tests in an image, ties are likely to occur, further causing ties among the combined statistics. If too many tests across an image share the same most extreme statistic, correction for the MTP-I, while still valid, becomes less powerful (Westfall and Young, 1993; Pantazis et al., 2005). The most obvious workaround — run an ever larger number of permutations to break the ties — may not be possible for small sample sizes, or when possible, requires correspondingly larger data storage.

The solution is loosely based on the direct combination of the test statistics, by converting the test statistics of the partial tests to values that behave as p-values, using the asymptotic distribution of the statistics for the partial tests. We call these as “u-values”, in order to emphasise that they are not meant to be read or interpreted as p-values, but rather as transitional values that allow combinations that otherwise would not be possible.

For spatial statistics, the asymptotic distribution of the combined statistic is used, this time to produce a z-score, which can be subjected to the computation of cluster extent, cluster mass, and/or threshold-free cluster enhancement (TFCE; Smith and Nichols, 2009). A flow chart of the modified algorithm is shown below. Click to see it side-by-side with the original.

More power, fewer assumptions

One of the most remarkable features of NPC is that the synchronised permutations implicitly account for the dependence structure among the partial tests. This means that even combining methods originally derived under the assumption of independence can be used when such independence is untenable. As the p-values are assessed via permutations, distributional restrictions are likewise not necessary, liberating NPC from most assumptions that thwart parametric methods in general. This renders NPC a good alternative to classical multivariate tests, such as MANOVA, MANCOVA, and Hotelling’s T2 tests: each of the response variables can be seen as an univariate partial test in the context of the combination, but without the assumptions that are embodied in these old multivariate tests.

As if all the above were not already sufficient, NPC is also more powerful than such classical multivariate tests. This refers to its finite sample consistency property, that is, even with fixed sample size, as the number of modalities being combined increases, the power of the test also increases. The power of classical multivariate tests, however, increases up to a certain point, then begins to decrease, eventually reaching zero when the number of combining variables match the sample size.

The figure below summarises the analysis of a subset of the subjects of a published FMRI study (Brooks et al, 2005) in which painful stimulation was applied to the face, hand, and foot of 12 subjects. Using permutation tests separately, no results could be identified for any of the three types of stimulation. A simple multivariate test, the Hotelling’s T2 test, even assessed using permutations, did not reveal any effect of stimulation either. The NPC results, however, suggest involvement of large portions of the anterior insula and secondary somatosensory cortex. The Fisher, Stouffer and Mudholkar–George combining functions were particularly successful in recovering a small area of activity in the midbrain and periaqueductal gray area, which would be expected from previous studies on pain, but that could not be located from the original, non-combined data.


Detailed assessment of power, using variable number of modalities, and of modalities containing signal, is shown in the paper.

Combinations or conjunctions?

Combination, as done via NPC, is different than conjunctions (Nichols et al., 2005) in the following: in the combination, one seeks for aggregate significance across partial tests, without the need that any individual study is necessarily significant. In the conjunction, it is necessary that all of them, with no exception, is significant. As indicated above, the NPC forms an union-intersection test (UIT; Roy, 1953), whereas the conjunctions form an intersection-union test (IUT; Berger, 1982). The former can be said to be significant if any (or an aggregate) of the partial tests is significant, whereas the latter is significant if all the partial tests are.

Availability

The NPC, with the modifications for brain imaging, is available in the tool PALM — Permutation Analysis of Linear Models. It runs in either Matlab or Octave, and is free (GPL).

References

Contributed to this post: Tom Nichols.

FSL on the Raspberry Pi


How about processing brain imaging data on a Raspberry Pi? The different versions of this little device have performed exceptionally well for education, entertainment, and for a variety of do-it-yourself projects, with many examples listed in websites such as Instructables and Adafruit. Most of these applications are not computationally as intensive. Yet, the small size, low power consumption, improved hardware in recent models, and low price, may make this feasible.

The Pi 2

Released earlier this year, the Raspberry Pi 2 (Model B) features a quad-core 900 MHz ARM processor, 1 GB of RAM, GPU, 4 USB ports, 10/100 Mbps Ethernet, HDMI and audio outputs, camera and display ports, as well as a low level general purpose interface (GPIO), all in a portable board of 85.6 mm × 56.5 mm (the same size as a credit card). It is powered by a 5 V, 800 mA DC (4 W) source, and sold for £30 or less.

Differently than earlier models, which had a CPU based on the ARMv6, the Pi 2 uses an ARM Cortex-A7 processor, which on its turn based on the ARMv7 architecture. Although there are Linux distributions that can run on the earlier models (such as ports of DebianopenSUSE, and Fedora), this change widens potential applications, not only because there are more ports available for ARMv7 (e.g., openSUSEDebianCentOS, among others), but also, the higher performance suggests that somewhat heavier data processing can be considered.

It is also possible to assemble multiple Pis in a cluster, using distributed computing engines such as SLURM, TORQUE or SGE. The Pi has the core requisites: it runs on Linux and comes with a decent 10/100 Mbps Ethernet port, such that creating a system is a matter of assembling the pieces and configuring.

Neuroimaging with a small footprint

With this relatively high amount of computing power in such a physically small size and affordable price, the question is immediate: It is feasible to do neuroimaging on the Pi? The availability of Linux distributions for ARM platforms suggest that yes. However, the binaries for imaging software distributed for popular platforms as x86 (i386) and x86-64 (amd64) cannot work directly. Rather, the applications would need to be compiled from source.

For the FMRIB Software Library (FSL), the source code can be downloaded and the compilation proceed. Much simpler than that, however, is to take a different route: FSL has been included in NeuroDebian. This alone does not seem helpful, as the packages in the repository are only for 32-bit and 64-bit PCs (the i386 and amd64 ports), and SPARC. However, these packages have made into the upstream Debian, which means they are available for all platforms for which Debian itself has been ported. This includes the ARMv7 that powers the Pi, for which the port armhf (for chips that use a hardware floating point unit) can be used.

The steps to have a working installation of FSL on the Pi 2 are described below. Other interesting software, such as FreeSurfer, would need to be compiled from the source. For SPM, there is no Matlab port at the moment, but Octave runs without problems, such that most functionalities are expected to work. Applications based on Java, such as Mango, work without problems.

Requirements

The photo at the top shows the hardware assembly used for this article. The following is required:

  • A Raspberry Pi 2 (Model B).
  • Power source (can be the USB port of another computer).
  • Micro SD card with at least 8 GB (below it is assumed 32 GB).
  • Ethernet cable and a network that provides internet access.
  • Optional: HDMI display and cable, USB keyboard, and possibly a USB mouse if a graphical system will be installed. Alternatively, a headless system also works, with access via SSH. Below it is assumed a display is connected.

The procedure

Step 1: Download the system image kindly prepared by Sjoerd Simons, and uncompress it:

wget https://images.collabora.co.uk/rpi2/jessie-rpi2-20150705.img.gz
gunzip jessie-rpi2-20150705.img.gz

This image contains only a minimal set of Debian Jessie packages. It uses the kernel 3.18.5, and received a few firmware and boot tweaks that are specific to the Pi 2.

Step 2: Use your favourite utility to transfer the image to the micro SD card. For example, using Linux, run the following, replacing /dev/sdX for the letter corresponding to your SD card (warning: this will erase all data stored in the card):

dd bs=1M if=jessie-rpi2-20150705.img of=/dev/sdX

In some systems, the card may be in /dev/mmcblk0 instead in /dev/sdX. If a Linux machine is not available, but instead a Mac or even Windows, the instructions to install Raspbian also apply.

Step 3: Insert the card in the Pi and boot the system. In this image, the default root password is debian (you can change it for something sensible).

Step 4: The main partition in this disk image (mounted as /) has only 2.6 GB, which is not enough. Also, often more than 1 GB of memory is needed, so swap space for virtual memory is necessary. Use fdisk (as root) to expand the main partition and to create a new partition for swap. Usually this is done interactively. If the card has exactly 32 GB, the line below can be used directly, bypassing the interactive mode. It will define a main partition with 24 GB, and the remaining, about 5 GB, will be left for swap. For cards of different sizes, run fdisk manually, or change the line below accordingly.

printf "d\n2\nn\np\n\n\n+24G\nn\np\n\n\n\nt\n3\n82\nw\n" | fdisk -uc /dev/mmcblk0

Note that, when seen from the Pi, the SD card is at /dev/mmcblk0, not /dev/sdX.

Reboot (shutdown -r now), then after logging in again, run:

resize2fs /dev/mmcblk0p2
mkswap /dev/mmcblk0p3
swapon /dev/mmcblk0p3

The swapon command enables the swap partition for immediate use. To make the change permanent for the next reboot, edit the file /etc/fstab adding:

/dev/mmcblk0p3 swap swap defaults 0 0

Step 5: Edit the /etc/apt/sources.list so as to include the official Debian packages (you can replace the server for your favourite/closer mirror):

deb http://ftp.uk.debian.org/debian/ jessie main contrib non-free

Step 6: Refresh the cached list of packages, then install FSL:

apt-get update
apt-get install fsl

Step 7: The installation is almost ready. The downloaded packages do not have the “data” directory of FSL, which contains the atlases and standard space images. To obtain these, do one of the following:

  • From a separate FSL installation (e.g., from a different computer), copy the contents of the ${FSLDIR}/data to the /usr/share/fsl/data of the newly installed system on the Raspberry Pi. This can be done over the network, via ssh, or after plugging in and mounting (with the correct privileges) the card in a different Linux system.
  • If another computer with FSL installed is not available, download FSL for CentOS or Mac (at the end of the downloads page, under “Advanced Users”), then uncompress the downloaded file, and copy the whole contents of the data directory to /usr/share/fsl/data of the Pi via ssh
  • If another computer is not at all accessible for this step, these files can be obtained using the Pi itself, from the command line. Logged in as root in the newly installed system, run:
cd /usr/share
wget -O- http://fsl.fmrib.ox.ac.uk/fsldownloads/fsl-5.0.9-centos6_64.tar.gz | tar xzfv - fsl/data
  • A last option is to skip this step, go to Step 9 below, then download and copy using a graphical web-browser from an installed desktop environment.

Step 8: Add this line to the file ~/.profile:

. /etc/fsl/fsl.sh

That’s it. All that is needed to run FSL from the command line has been done.

Step 9 (optional): The installation up to this step does not include a graphical user interface. To have one, install X and a desktop environment. For lightweight options, LXDE or XFCE can be considered. A screenshot of LXDE with FSLview and two terminal windows showing some system information is below (usually one would not run as root, but create an user account).

Using Raspbian

The official operating system for the Raspberry Pi, Raspbian, is a customised version of Debian, thus capable of running FSL directly. However, FSL is not in the official rpi repository. It can still be installed following similar steps as above, remembering to use sudo with commands require root privileges (the default account is rpi and the password is raspberry), and with care in the repartitioning, as the official disk image uses a different scheme. In Step 5, include the same Debian package source in the /etc/apt/sources.list file.

Overclocking

The Pi can be overclocked. Conservative, stable settings, that do not void the warranty, consist of increasing the CPU frequency to 1000 MHz (from the default 900), the GPU and SDRAM frequencies to 500 MHz (the defaults are 250 and 450 respectively), and the CPU/GPU voltage by 2 steps, i.e., by 2 × 25 mV, from 1.20 to 1.25 V. The overclock settings are adjusted in the file /boot/firmware/config.txt if using Debian (following the steps above), or in /boot/config.txt if using Raspbian:

arm_freq=1000
gpu_freq=500
sdram_freq=500
over_voltage=2

These settings cause the bogomips to jump from 38.40 to 64.00. The temperature of the onboard chips can increase, however, and a suggestion is to use heatsinks or fans, which are inexpensive and can be purchased online (fans would be powered by GPIO pins).

Performance

With the system up and running, it is time for some benchmarks. Although the assembly is exciting and in general the system speed respectable, unfortunately, processing using FEEDS suggests a poor performance. The table below compares the timings of the Pi 2 with default versus overclocked settings, relative to a minimal install of the Debian Jessie on a notebook with an Intel Core i5 processor and 8 GB of RAM.

Default settings Overclocked settings
PRELUDE & FUGUE  6.0  4.8
SUSAN  15.1  11.9
SIENAX  13.3  10.4
BET2  12.3  9.4
FEAT  12.1  9.6
MELODIC  15.9  12.2
FIRST  14.0  11.1
FDT  7.4  5.9
FNIRT  26.6  19.3
Total time  12.2  9.5

Running the whole FEEDS took 22 minutes in the Intel Core i5, whereas in the Pi 2 it took 4h29min with the default settings, and 3h30min after overclocking. It should be noted, however, that the 1 GB of RAM is not sufficient to run the test without using virtual memory (swapping). This needs to be taken into account when evaluating the table above. The SD card used for the tests is a Class 10, which is not as fast as actual RAM (faster cards would have their performance curtailed by hardware limits).

The performances of Debian and Raspbian on the Pi 2 are nearly identical. Running in the graphical mode (at least with LXDE) or in a console-only system do not seem to impact results, at least as far only one instance of FEEDS was running.

Conclusion

It is possible to run FSL on the Raspberry Pi 2, and the procedure is not too different than doing the same in an ordinary computer. The performance, however, suggests that the current model, being about ten times slower, may not be a competitive choice for brain imaging.

Permutation tests in the Human Connectome Project

Permutation tests are known to be superior to parametric tests: they are based on only few assumptions, essentially that the data are exchangeable, and allow the correction for the multiplicity of tests and the use of various non-standard statistics. The exchangeability assumption allows data to be permuted whenever their joint distribution remains unaltered. Usually this means that each observation needs to be independent from the others.

In many studies, however, there are repeated measurements on the same subjects, which violates exchangeability: clearly, the various measurements obtained from a given subject are not independent from each other. In the Human Connectome Project (HCP) (Van Essen et al, 2012; 2013; see references at the end), subjects are sampled along with their siblings (most of them are twins), such that independence cannot be guaranteed either.

In Winkler et al. (2014), certain structured types of non-independence in brain imaging were addressed through the definition of exchangeability blocks (EBs). Observations within EB can be shuffled freely or, alternatively, the EBs themselves can be shuffled as a whole. This allows various designs that otherwise could not be assessed through permutations.

The same idea can be generalised for blocks that are nested within other blocks, in a multi-level fashion. In the paper Multi-level Block Permutation (Winkler et al., 2015) we presented a method that allows blocks to be shuffled a whole, and inside them, sub-blocks are further allowed to be shuffled, in a recursive process. The method is flexible enough to accommodate permutations, sign-flippings (sometimes also called “wild bootstrap”), and permutations together with sign-flippings.

In particular, this permutation scheme allows the data of the HCP to be analysed via permutations: subjects are allowed to be shuffled with their siblings while keeping the joint distribution intra-sibship maintained. Then each sibship is allowed to be shuffled with others of the same type.

In the paper we show that the error type I is controlled at the nominal level, and the power is just marginally smaller than that would be obtained by permuting freely if free permutation were allowed. The more complex the block structure, the larger the reductions in power, although with large sample sizes, the difference is barely noticeable.

Importantly, simply ignoring family structure in designs as this causes the error rates not to be controlled, with excess false positives, and invalid results. We show in the paper examples of false positives that can arise, even after correction for multiple testing, when testing associations between cortical thickness, cortical area, and measures of body size as height, weight, and body-mass index, all of them highly heritable. Such false positives can be avoided with permutation tests that respect the family structure.

The figure at the top shows how the subjects of the HCP (terminal dots, shown in white colour) can be shuffled or not, while respecting the family structure. Blue dots indicate branches that can be permuted, whereas red dots indicate branches that cannot (see the main paper for details). This diagram includes 232 subjects of an early public release of HCP data. The tree on the left considers dizygotic twins as a category on their own, i.e., that cannot be shuffled with ordinary siblings, whereas the tree on the right considers dizygotic twins as ordinary siblings.

The first applied study using our strategy has just appeared. The method is implemented in the freely available package PALM — Permutation Analysis of Linear Models, and a set of practical steps to use it with actual HCP data is available here.

References

Another look into Pillai’s trace

In a previous post, all commonly used univariate and multivariate test statistics used with the general linear model (GLM) were presented. Here an alternative formulation for one of these statistics, the Pillai’s trace (Pillai, 1955, references at the end), commonly used in MANOVA and MANCOVA tests, is presented.

We begin with a multivariate general linear model expressed as:

\mathbf{Y} = \mathbf{M} \boldsymbol{\psi} + \boldsymbol{\epsilon}

where \mathbf{Y} is the N \times K full rank matrix of observed data, with N observations of K distinct (possibly non-independent) variables, \mathbf{M} is the full-rank N \times R design matrix that includes explanatory variables (i.e., effects of interest and possibly nuisance effects), \boldsymbol{\psi} is the R \times K vector of R regression coefficients, and \boldsymbol{\epsilon} is the N \times K vector of random errors. Estimates for the regression coefficients can be computed as \boldsymbol{\hat{\psi}} = \mathbf{M}^{+}\mathbf{Y}, where the superscript (^{+}) denotes a pseudo-inverse.

The null hypothesis, and a simplification

One is generally interested in testing the null hypothesis that a contrast of regression coefficients is equal to zero, i.e., \mathcal{H}_{0} : \mathbf{C}'\boldsymbol{\psi}\mathbf{D} = \boldsymbol{0}, where \mathbf{C} is a R \times S full-rank matrix of S contrasts of coefficients on the regressors encoded in \mathbf{X}, 1 \leqslant S \leqslant R and \mathbf{D} is a K \times Q full-rank matrix of Q contrasts of coefficients on the dependent, response variables in \mathbf{Y}, 1 \leqslant Q \leqslant K; if K=1 or Q=1, the model is univariate. Once the hypothesis has been established, \mathbf{Y} can be equivalently redefined as \mathbf{Y}\mathbf{D}, such that the contrast \mathbf{D} can be omitted for simplicity, and the null hypothesis stated as \mathcal{H}_{0} : \mathbf{C}'\boldsymbol{\psi} = \boldsymbol{0}.

Model partitioning

It is useful to consider a transformation of the model into a partitioned one:

\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon}

where \mathbf{X} is the matrix with regressors of interest, \mathbf{Z} is the matrix with nuisance regressors, and \boldsymbol{\beta} and \boldsymbol{\gamma} are respectively the vectors of regression coefficients. From this model we can also define the projection (hat) matrices \mathbf{H}_{\mathbf{X}}=\mathbf{X}\mathbf{X}^{+} and \mathbf{H}_{\mathbf{Z}}=\mathbf{Z}\mathbf{Z}^{+} due to tue regressors of interest and nuisance, respectively, and the residual-forming matrices \mathbf{R}_{\mathbf{X}}=\mathbf{I}-\mathbf{H}_{\mathbf{X}} and \mathbf{R}_{\mathbf{Z}}=\mathbf{I}-\mathbf{H}_{\mathbf{Z}}.

Such partitioning is not unique, and schemes can be as simple as separating apart the columns of \mathbf{M} as \left[ \mathbf{X} \; \mathbf{Z} \right], with \boldsymbol{\psi} = \left[ \boldsymbol{\beta}' \; \boldsymbol{\gamma}' \right]'. More involved strategies can, however, be devised to obtain some practical benefits. One such partitioning is to define \mathbf{X} = \mathbf{M} \mathbf{Q} \mathbf{C} \left(\mathbf{C}'\mathbf{Q}\mathbf{C}\right)^{-1} and
\mathbf{Z} = \mathbf{M} \mathbf{Q} \mathbf{C}_v \left(\mathbf{C}_v'\mathbf{Q}\mathbf{C}_v\right)^{-1}, where \mathbf{Q}=(\mathbf{M}'\mathbf{M})^{-1}, \mathbf{C}_v=\mathbf{C}_u-\mathbf{C}(\mathbf{C}'\mathbf{Q}\mathbf{C})^{-1}\mathbf{C}'\mathbf{Q}\mathbf{C}_u, and \mathbf{C}_u has r-\mathsf{rank}\left(\mathbf{C}\right) columns that span the null space of \mathbf{C}, such that [\mathbf{C} \; \mathbf{C}_u] is a r \times r invertible, full-rank matrix (Smith et al, 2007). This partitioning has a number of features: \boldsymbol{\hat{\beta}} = \mathbf{C}'\boldsymbol{\hat{\psi}}, \widehat{\mathsf{Cov}}(\boldsymbol{\hat{\beta}}) = \mathbf{C}'\widehat{\mathsf{Cov}}(\boldsymbol{\hat{\psi}})\mathbf{C}, i.e., estimates and variances of \boldsymbol{\beta} for inference on the partitioned model correspond exactly to the same inference on the original model, \mathbf{X} is orthogonal to \mathbf{Z}, and \mathsf{span}(\mathbf{X}) \cup \mathsf{span}(\mathbf{Z}) = \mathsf{span}(\mathbf{M}), i.e., the partitioned model spans the same space as the original.

Another partitioning scheme, derived by Ridgway (2009), defines \mathbf{X}=\mathbf{M}(\mathbf{C}^+)' and \mathbf{Z}=\mathbf{M}-\mathbf{M}\mathbf{C}\mathbf{C}^{+}. As with the previous strategy, the parameters of interest in the partitioned model are equal to the contrast of the original parameters. A full column rank nuisance partition can be obtained from the singular value decomposition (SVD) of \mathbf{Z}, which will also provide orthonormal columns for the nuisance partition. Orthogonality between regressors of interest and nuisance can be obtained by redefining the regressors of interest as \mathbf{R}_{\mathbf{Z}}\mathbf{X}.

The usual multivariate statistics

For the multivariate statistics, define generically:

\mathbf{H}=(\mathbf{C}'\boldsymbol{\hat{\psi}}\mathbf{D})' \left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1} (\mathbf{C}'\boldsymbol{\hat{\psi}}\mathbf{D})

as the sums of products explained by the model (hypothesis) and:

\mathbf{E} = (\boldsymbol{\hat{\epsilon}}\mathbf{D})'(\boldsymbol{\hat{\epsilon}}\mathbf{D})

as the sums of the products of the residuals, i.e., that remain unexplained. With the simplification to the original model that redefined \mathbf{Y} as \mathbf{Y}\mathbf{D}, the \mathbf{D} can be dropped, so that we have \mathbf{H}=(\mathbf{C}'\boldsymbol{\hat{\psi}})' \left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1} (\mathbf{C}'\boldsymbol{\hat{\psi}}) and \mathbf{E} = \boldsymbol{\hat{\epsilon}}'\boldsymbol{\hat{\epsilon}}. The various well-known multivariate statistics (see this earlier blog entry) can be written as a function of \mathbf{H} and \mathbf{E}. Pillai’s trace is:

T=\mathsf{trace}\left(\mathbf{H}(\mathbf{H}+\mathbf{E})^{-1}\right)

More simplifications

With the partitioning, other simplifications are possible:

\mathbf{H}=\boldsymbol{\hat{\beta}}' (\mathbf{X}'\mathbf{X})\boldsymbol{\hat{\beta}} = ( \mathbf{X}\boldsymbol{\hat{\beta}})'(\mathbf{X}\boldsymbol{\hat{\beta}})

Recalling that \mathbf{X}'\mathbf{Z}=\mathbf{0}, and defining \tilde{\mathbf{Y}}=\mathbf{R}_{\mathbf{Z}}\mathbf{Y}, we have:

\mathbf{H} = (\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}})'(\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}}) = \tilde{\mathbf{Y}}'\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}}

The unexplained sums of products can be written in a similar manner:

\mathbf{E} = (\mathbf{R}_{\mathbf{X}}\tilde{\mathbf{Y}})'(\mathbf{R}_{\mathbf{X}}\tilde{\mathbf{Y}}) = \tilde{\mathbf{Y}}'\mathbf{R}_{\mathbf{X}}\tilde{\mathbf{Y}}

The term \mathbf{H}+\mathbf{E} in the Pillai’s trace can therefore be rewritten as:

\mathbf{H}+\mathbf{E}= \tilde{\mathbf{Y}}'(\mathbf{H}_{\mathbf{X}} + \mathbf{R}_{\mathbf{X}})\tilde{\mathbf{Y}} = \tilde{\mathbf{Y}}'\tilde{\mathbf{Y}}

Using the property that the trace of a product is invariant to a circular permutation of the factors, Pillai’s statistic can then be written as:

\begin{array}{ccl} T&=&\mathsf{trace}\left(\tilde{\mathbf{Y}}'\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}}\left(\tilde{\mathbf{Y}}'\tilde{\mathbf{Y}}\right)^{+}\right)\\ {}&=&\mathsf{trace}\left(\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}}\left(\tilde{\mathbf{Y}}'\tilde{\mathbf{Y}}\right)^{+}\tilde{\mathbf{Y}}'\right)\\ {}&=&\mathsf{trace}\left(\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}}\tilde{\mathbf{Y}}^{+}\right)\\ \end{array}

The final, alternative form

Using sigular value decomposition we have \tilde{\mathbf{Y}} = \mathbf{U}\mathbf{S}\mathbf{V}' and \tilde{\mathbf{Y}}^{+} = \mathbf{V}\mathbf{S}^{+}\mathbf{U}', where \mathbf{U} contains only the columns that correspond to non-zero eigenvalues. Thus, the above can be rewritten as:

\begin{array}{ccl} T&=&\mathsf{trace}\left(\mathbf{H}_{\mathbf{X}} \mathbf{U}\mathbf{S}\mathbf{V}' \mathbf{V}\mathbf{S}^{+}\mathbf{U}' \right)\\ {}&=&\mathsf{trace}\left(\mathbf{H}_{\mathbf{X}} \mathbf{U}\mathbf{U}' \right)\\ \end{array}

The SVD transformation is useful for languages or libraries that offer a fast implementation. Otherwise, using a pseudoinverse yields the same result, perhaps only slightly slower. In this case, T=\mathsf{trace}\left(\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}}\tilde{\mathbf{Y}}^{+}\right).

Importance

If we define \mathbf{A}\equiv\mathbf{H}_{\mathbf{X}} and \mathbf{W}\equiv\mathbf{U}\mathbf{U}' (or \mathbf{W}\equiv\tilde{\mathbf{Y}}\tilde{\mathbf{Y}}^{+}), then T=\mathsf{trace}\left(\mathbf{A}\mathbf{W}\right). The first three moments of the permutation distribution of statistics that can be written in this form can be computed analytically once \mathbf{A} and \mathbf{W} are known. With the first three moments, a gamma distribution (Pearson type III) can be fit, thus allowing p-values to be computed without resorting to the usual beta approximation to Pillai’s trace, nor using permutations, yet with results that are not based on the assumption of normality (Mardia, 1971; Kazi-Aoual, 1995; Minas and Montana, 2014).

Availability

This simplification is available in PALM, for use with imaging and non-imaging data, using Pillai’s trace itself, or a modification that allows inference on univariate statistics. As of today, this option is not yet documented, but should become openly available soon.

References

Update: 20.Jan.2016: A slight simplification was applied to the formulas above so as to make them more elegant and remove some redundancy. The result is the same.