Simplifying Freedman-Lane

Doing a permutation test with the general linear model (GLM) in the presence of nuisance variables can be challenging. Let the model be:

\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon}

where \mathbf{Y} is a matrix of observed variables, \mathbf{X} is a matrix of predictors of interest, \mathbf{Z} is a matrix of covariates (of no interest), and \boldsymbol{\epsilon} is a matrix of the same size as \mathbf{Y} with the residuals.

Because the interest is in testing the relationship between \mathbf{Y} and \mathbf{X}, in principle it would be these that would need be permuted, but doing so also breaks the relationship with \mathbf{Z}, which would be undesirable. Over the years, many methods have been proposed. A review can be found in Winkler et al. (2014); other previous work include the papers by Anderson and Legendre (1999) and Anderson and Robinson (2001).

One of these various methods is the one published in Freedman and Lane (1983), which consists of permuting data that has been residualised with respect to the covariates, then estimated covariate effects added back, then the full model fitted again. The procedure can be performed through the following steps:

  1. Regress \mathbf{Y} against the full model that contains both the effects of interest and the nuisance variables, i.e., \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon}. Use the estimated parameters \boldsymbol{\hat{\beta}} to compute the statistic of interest, and call this statistic T_{0}.
  2. Regress \mathbf{Y} against a reduced model that contains only the covariates, i.e. \mathbf{Y} = \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon}_{\mathbf{Z}}, obtaining estimated parameters \boldsymbol{\hat{\gamma}} and estimated residuals \boldsymbol{\hat{\epsilon}}_{\mathbf{Z}}.
  3. Compute a set of permuted data \mathbf{Y}^{*}_{j}. This is done by pre-multiplying the residuals from the reduced model produced in the previous step, \boldsymbol{\hat{\epsilon}}_{\mathbf{Z}}, by a permutation matrix, \mathbf{P}{j}, then adding back the estimated nuisance effects, i.e. \mathbf{Y}^{*}_{j} = \mathbf{P}_{j}\boldsymbol{\hat{\epsilon}}_{\mathbf{Z}} + \mathbf{Z}\boldsymbol{\hat{\gamma}}.
  4. Regress the permuted data \mathbf{Y}^{*}_{j} against the full model, i.e. \mathbf{Y}^{*}_{j} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon}
  5. Use the estimated \boldsymbol{\hat{\beta}}^{*}_{j} to compute the statistic of interest. Call this statistic T^{*}_{j}.
  6. Repeat the Steps 2-4 many times to build the reference distribution of T^{*} under the null hypothesis of no association between \mathbf{Y} and \mathbf{X}.
  7. Count how many times T^{*}_{j} was found to be equal to or larger than T_{0}, and divide the count by the number of permutations; the result is the p-value.

Steps 1-4 can be written concisely as:

\left(\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}+\mathbf{H}_{\mathbf{Z}}\right) \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma}+\boldsymbol{\epsilon}

where \mathbf{P}_{j} is a permutation matrix (for the j-th permutation, \mathbf{H}_{\mathbf{Z}}=\mathbf{Z}\mathbf{Z}^{+} is the hat matrix due to the covariates, and \mathbf{R}_{\mathbf{Z}} = \mathbf{I} - \mathbf{H}_{\mathbf{Z}} is the residual forming matrix; the superscript symbol ^{+} represents a matrix pseudo-inverse.

In page 385 of Winkler et al. (2014), my colleagues and I state that:

[…] add the nuisance variables back in Step 3 is not strictly necessary, and the model can be expressed simply as \mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}\mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\mathbf{Z}\boldsymbol{\gamma}+\boldsymbol{\epsilon}, implying that the permutations can actually be performed just by permuting the rows of the residual-forming matrix \mathbf{R}_{\mathbf{Z}}.

However, in the paper we do not offer any proof of this important result, that allows algorithmic acceleration. Here we remedy that. Let’s start with two brief lemmata:

Lemma 1: The product of a hat matrix and its corresponding residual-forming matrix is zero, that is, \mathbf{R}_{\mathbf{Z}}\mathbf{H}_{\mathbf{Z}} = \mathbf{H}_{\mathbf{Z}}\mathbf{R}_{\mathbf{Z}} = \mathbf{0}.

This is because \mathbf{R}_{\mathbf{Z}} = \mathbf{I} - \mathbf{H}_{\mathbf{Z}}, hence \mathbf{R}_{\mathbf{Z}}\mathbf{H}_{\mathbf{Z}} = \mathbf{R}_{\mathbf{Z}}(\mathbf{I} - \mathbf{R}_{\mathbf{Z}}) = \mathbf{R}_{\mathbf{Z}} - \mathbf{R}_{\mathbf{Z}}\mathbf{R}_{\mathbf{Z}} = \mathbf{R}_{\mathbf{Z}} - \mathbf{R}_{\mathbf{Z}} = \mathbf{0} since \mathbf{R}_{\mathbf{Z}} is idempotent.

Lemma 2 (Frisch–Waugh–Lovell theorem): Given a GLM expressed as \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon}, we can estimate \boldsymbol{\beta} from an equivalent GLM written as \mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta} + \mathbf{R}_{\mathbf{Z}}\boldsymbol{\epsilon}.

To see why, remember that multiplying both sides of an equation by the same factor does not change it (least squares solutions may change; transformations using Lemma 2 below do not act on the fitted model). Let’s start from:

\mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}(\mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon})

Then remove the parentheses:

\mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta} + \mathbf{R}_{\mathbf{Z}}\mathbf{Z}\boldsymbol{\gamma} + \mathbf{R}_{\mathbf{Z}}\boldsymbol{\epsilon}

Since \mathbf{R}_{\mathbf{Z}} = \mathbf{I} - \mathbf{H}_{\mathbf{Z}}:

\mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta} + (\mathbf{I}-\mathbf{H}_{\mathbf{Z}})\mathbf{Z}\boldsymbol{\gamma} + \mathbf{R}_{\mathbf{Z}}\boldsymbol{\epsilon}

and that \mathbf{H}_{\mathbf{Z}} = \mathbf{Z}\mathbf{Z}^{+}:

\mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta} + (\mathbf{Z}-\mathbf{Z}\mathbf{Z}^{+}\mathbf{Z})\boldsymbol{\gamma} + \mathbf{R}_{\mathbf{Z}}\boldsymbol{\epsilon}

Since \mathbf{Z}^{+}\mathbf{Z}=\mathbf{I}:

\mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta} + \mathbf{0}\boldsymbol{\gamma} + \mathbf{R}_{\mathbf{Z}}\boldsymbol{\epsilon}

\mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}_{\mathbf{Z}}

where \boldsymbol{\epsilon}_{\mathbf{Z}}=  \mathbf{R}_{\mathbf{Z}}\boldsymbol{\epsilon}.

Main result

Now we are ready for the main result. The Freedman-Lane model is:

\left(\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}+\mathbf{H}_{\mathbf{Z}}\right) \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma}+\boldsymbol{\epsilon}

Per Lemma 2, it can be rewritten as:

\mathbf{R}_{\mathbf{Z}}\left(\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}+\mathbf{H}_{\mathbf{Z}}\right) \mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}_{\mathbf{Z}}

Dropping the parenthesis:

\mathbf{R}_{\mathbf{Z}}\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}\mathbf{Y} +\mathbf{R}_{\mathbf{Z}}\mathbf{H}_{\mathbf{Z}} \mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}_{\mathbf{Z}}

Per Lemma 1:

\mathbf{R}_{\mathbf{Z}}\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}\mathbf{Y} + \mathbf{0}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}_{\mathbf{Z}}

\mathbf{R}_{\mathbf{Z}}\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}\mathbf{Y}= \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}_{\mathbf{Z}}

What is left has the same form as the result of Lemma 2. Thus, reversing it, we obtain the final result:

\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}} \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} +\boldsymbol{\epsilon}

Hence, the hat matrix \mathbf{H}_{\mathbf{Z}} cancels out, meaning that it is not necessary. Results are the same both ways.

References

Updates

Non-Parametric Combination (NPC) for brain imaging

Have you ever had an analysis in which there was a large set of contrasts, all of interest, and you were worried about multiple testing? An eventual effect would be missed by a simple Bonferroni correction, but you did not know what else to do? Or did you have a set of different studies and you wished to obtain a style of meta-analytic result, indicating whether there would be evidence across all of them, without requiring the studies to be all consistently significant?

The Non-Parametric Combination (NPC) solves these issues. It is a way of performing joint inference on multiple data collected on the same experimental units (e.g., same subjects), all with minimal assumptions. The method was proposed originally by Pesarin (1990, 1992) [see references below], independently by Blair and Karninski (1993), and described extensively by Pesarin and Salmaso (2010). In this blog entry, the NPC is presented in brief, with emphasis on the modifications we introduce to render it feasible for brain imaging. The complete details are in our paper that has just been published in the journal Human Brain Mapping.

NPC in a nutshell

The NPC consists of, in a first phase, testing each hypothesis separately using permutations that are performed synchronously across datasets; these tests are termed partial tests. The resulting statistics for each and every permutation are recorded, allowing an estimate of the complete empirical null distribution to be constructed for each one. In a second phase, the empirical p-values for each statistic are combined, for each permutation, into a joint statistic. As such a combined joint statistic is produced from the previous permutations, an estimate of its empirical distribution function is immediately known, and so is the p-value of the joint test. A flowchart of the original algorithm is shown below; click to see it side-by-side with the modified one (described below).

A host of combining functions

The null hypothesis of the NPC is that null hypotheses for all partial tests are true, and the alternative hypothesis that any is false, which is the same null of a union-intersection test (UIT; Roy, 1953). The rejection region depends on how the combined statistic is produced. Various combining functions, which produce such combined statistics, can be considered, and some of the most well known are listed in the table below:

Method Statistic p-value
Tippett \min \left(p_{k}\right) 1-\left(1-T\right)^{K}
Fisher -2 \sum_{k=1}^{K} \ln\left(p_{k}\right) 1-\chi^{2}\left(T;\;\nu=2K\right)
Stouffer \frac{1}{\sqrt{K}} \sum_{k=1}^{K} \Phi^{-1}\left(1-p_{k}\right) 1-\Phi\left(T;\;\mu=0,\;\sigma^2=1\right)
Mudholkar–George \frac{1}{\pi}\sqrt{\frac{3(5K+4)}{K(5K+2)}}\sum_{k=1}^{K} \ln\left(\frac{1-p_{k}}{p_{k}}\right) 1-t_{\text{cdf}}(T;\;\nu=5K+4)

In the table, K is the number of partial tests, and the remaining of the variables follow the usual notation (see the Table 1 in the paper for the complete description). Many of these combining functions were proposed over the years for applications such as meta-analyses, and many of them assume independence between the tests being combined, and will give incorrect p-values if such assumption is not met. In the NPC, lack of dependence is not a problem, even if these same functions are used: the synchronised permutations ensure that any dependence, if existing, is taken into account, and this is done so implicitly, with no need for explicit modelling.

The different combining functions lead to different rejection regions for the null hypothesis. For the four combining functions in the table above, the respective rejection regions are in the figure below.

The combining functions can be modified to allow combination of tests so as to favour hypotheses with concordant directions, or be modified for bi-directional tests. Click on the figure above for examples of these cases (again, see the paper for the complete details).

Two problems, one solution

The multiple testing problem is well known in brain imaging: as an image comprises thousands of voxels/vertices/faces, correction is necessary. Bonferroni is in general too conservative, and various other approaches have been proposed, such as the random field theory. Permutation tests provide control over the familywise error rate (FWER) for the multiple tests across space, requiring only the assumption of exchangeability. This is all well known; see Nichols and Hayasaka (2003) and Winkler et al. (2014) for details.

However, another type of multiple testing is also common: analyses that test multiple hypotheses using the same model, multiple pairwise group comparisons, multiple and distinct models, studies using multiple modalities, that mix imaging and non-imaging data, that consider multiple processing pipelines, and even multiple multivariate analyses. All these common cases also need multiple testing correction. We call this multiple testing problem MTP-II, to discern it from the well known multiple testing problem across space, described above, which we term MTP-I.

One of the many combining functions possible with NPC, the one proposed by Tippett (1931), has a further property that makes it remarkably interesting. The Tippett function uses the smallest p-value across partial tests as its test statistic. Alternatively, if all statistics are comparable, it can be formulated in terms of the maximum statistic. It turns out that the distribution of the maximum statistic across a set of tests is also the distribution that can be used in a closed testing procedure (Marcus et al., 1976) to correct for the familywise error rate (FWER) using resampling methods, such as permutation. In the context of joint inference, FWER-correction can also be seen as an UIT. Thus, NPC offers a link between combination of multiple tests, and correction for multiple tests, in both cases regardless of any dependence between such tests.

This means that the MTP-II, for which correction in the parametric realm is either non-existing or fiendishly difficult, can be accommodated easily. It requires no explicit modelling of the dependence between the tests, and the resulting error rates are controlled exactly at the test level, adding rigour to what otherwise could lead to an excess of false positives without correction, or be overly conservative if a naïve correction such as Bonferroni were attempted.

Modifying for imaging applications

As originally proposed, in practice NPC cannot be used in brain imaging. As the statistics for all partial tests for all permutations need to be recorded, an enormous amount of space for data storage is necessary. Even if storage space were not a problem, the discreteness of the p-values for the partial tests is problematic when correcting for multiple testing, because with thousands of tests in an image, ties are likely to occur, further causing ties among the combined statistics. If too many tests across an image share the same most extreme statistic, correction for the MTP-I, while still valid, becomes less powerful (Westfall and Young, 1993; Pantazis et al., 2005). The most obvious workaround — run an ever larger number of permutations to break the ties — may not be possible for small sample sizes, or when possible, requires correspondingly larger data storage.

The solution is loosely based on the direct combination of the test statistics, by converting the test statistics of the partial tests to values that behave as p-values, using the asymptotic distribution of the statistics for the partial tests. We call these as “u-values”, in order to emphasise that they are not meant to be read or interpreted as p-values, but rather as transitional values that allow combinations that otherwise would not be possible.

For spatial statistics, the asymptotic distribution of the combined statistic is used, this time to produce a z-score, which can be subjected to the computation of cluster extent, cluster mass, and/or threshold-free cluster enhancement (TFCE; Smith and Nichols, 2009). A flow chart of the modified algorithm is shown below. Click to see it side-by-side with the original.

More power, fewer assumptions

One of the most remarkable features of NPC is that the synchronised permutations implicitly account for the dependence structure among the partial tests. This means that even combining methods originally derived under the assumption of independence can be used when such independence is untenable. As the p-values are assessed via permutations, distributional restrictions are likewise not necessary, liberating NPC from most assumptions that thwart parametric methods in general. This renders NPC a good alternative to classical multivariate tests, such as MANOVA, MANCOVA, and Hotelling’s T2 tests: each of the response variables can be seen as an univariate partial test in the context of the combination, but without the assumptions that are embodied in these old multivariate tests.

As if all the above were not already sufficient, NPC is also more powerful than such classical multivariate tests. This refers to its finite sample consistency property, that is, even with fixed sample size, as the number of modalities being combined increases, the power of the test also increases. The power of classical multivariate tests, however, increases up to a certain point, then begins to decrease, eventually reaching zero when the number of combining variables match the sample size.

The figure below summarises the analysis of a subset of the subjects of a published FMRI study (Brooks et al, 2005) in which painful stimulation was applied to the face, hand, and foot of 12 subjects. Using permutation tests separately, no results could be identified for any of the three types of stimulation. A simple multivariate test, the Hotelling’s T2 test, even assessed using permutations, did not reveal any effect of stimulation either. The NPC results, however, suggest involvement of large portions of the anterior insula and secondary somatosensory cortex. The Fisher, Stouffer and Mudholkar–George combining functions were particularly successful in recovering a small area of activity in the midbrain and periaqueductal gray area, which would be expected from previous studies on pain, but that could not be located from the original, non-combined data.


Detailed assessment of power, using variable number of modalities, and of modalities containing signal, is shown in the paper.

Combinations or conjunctions?

Combination, as done via NPC, is different than conjunctions (Nichols et al., 2005) in the following: in the combination, one seeks for aggregate significance across partial tests, without the need that any individual study is necessarily significant. In the conjunction, it is necessary that all of them, with no exception, is significant. As indicated above, the NPC forms an union-intersection test (UIT; Roy, 1953), whereas the conjunctions form an intersection-union test (IUT; Berger, 1982). The former can be said to be significant if any (or an aggregate) of the partial tests is significant, whereas the latter is significant if all the partial tests are.

Availability

The NPC, with the modifications for brain imaging, is available in the tool PALM — Permutation Analysis of Linear Models. It runs in either Matlab or Octave, and is free (GPL).

References

Contributed to this post: Tom Nichols.

Permutation tests in the Human Connectome Project

Permutation tests are known to be superior to parametric tests: they are based on only few assumptions, essentially that the data are exchangeable, and allow the correction for the multiplicity of tests and the use of various non-standard statistics. The exchangeability assumption allows data to be permuted whenever their joint distribution remains unaltered. Usually this means that each observation needs to be independent from the others.

In many studies, however, there are repeated measurements on the same subjects, which violates exchangeability: clearly, the various measurements obtained from a given subject are not independent from each other. In the Human Connectome Project (HCP) (Van Essen et al, 2012; 2013; see references at the end), subjects are sampled along with their siblings (most of them are twins), such that independence cannot be guaranteed either.

In Winkler et al. (2014), certain structured types of non-independence in brain imaging were addressed through the definition of exchangeability blocks (EBs). Observations within EB can be shuffled freely or, alternatively, the EBs themselves can be shuffled as a whole. This allows various designs that otherwise could not be assessed through permutations.

The same idea can be generalised for blocks that are nested within other blocks, in a multi-level fashion. In the paper Multi-level Block Permutation (Winkler et al., 2015) we presented a method that allows blocks to be shuffled a whole, and inside them, sub-blocks are further allowed to be shuffled, in a recursive process. The method is flexible enough to accommodate permutations, sign-flippings (sometimes also called “wild bootstrap”), and permutations together with sign-flippings.

In particular, this permutation scheme allows the data of the HCP to be analysed via permutations: subjects are allowed to be shuffled with their siblings while keeping the joint distribution intra-sibship maintained. Then each sibship is allowed to be shuffled with others of the same type.

In the paper we show that the error type I is controlled at the nominal level, and the power is just marginally smaller than that would be obtained by permuting freely if free permutation were allowed. The more complex the block structure, the larger the reductions in power, although with large sample sizes, the difference is barely noticeable.

Importantly, simply ignoring family structure in designs as this causes the error rates not to be controlled, with excess false positives, and invalid results. We show in the paper examples of false positives that can arise, even after correction for multiple testing, when testing associations between cortical thickness, cortical area, and measures of body size as height, weight, and body-mass index, all of them highly heritable. Such false positives can be avoided with permutation tests that respect the family structure.

The figure at the top shows how the subjects of the HCP (terminal dots, shown in white colour) can be shuffled or not, while respecting the family structure. Blue dots indicate branches that can be permuted, whereas red dots indicate branches that cannot (see the main paper for details). This diagram includes 232 subjects of an early public release of HCP data. The tree on the left considers dizygotic twins as a category on their own, i.e., that cannot be shuffled with ordinary siblings, whereas the tree on the right considers dizygotic twins as ordinary siblings.

The first applied study using our strategy has just appeared. The method is implemented in the freely available package PALM — Permutation Analysis of Linear Models, and a set of practical steps to use it with actual HCP data is available here.

References

Another look into Pillai’s trace

In a previous post, all commonly used univariate and multivariate test statistics used with the general linear model (GLM) were presented. Here an alternative formulation for one of these statistics, the Pillai’s trace (Pillai, 1955, references at the end), commonly used in MANOVA and MANCOVA tests, is presented.

We begin with a multivariate general linear model expressed as:

\mathbf{Y} = \mathbf{M} \boldsymbol{\psi} + \boldsymbol{\epsilon}

where \mathbf{Y} is the N \times K full rank matrix of observed data, with N observations of K distinct (possibly non-independent) variables, \mathbf{M} is the full-rank N \times R design matrix that includes explanatory variables (i.e., effects of interest and possibly nuisance effects), \boldsymbol{\psi} is the R \times K vector of R regression coefficients, and \boldsymbol{\epsilon} is the N \times K vector of random errors. Estimates for the regression coefficients can be computed as \boldsymbol{\hat{\psi}} = \mathbf{M}^{+}\mathbf{Y}, where the superscript (^{+}) denotes a pseudo-inverse.

The null hypothesis, and a simplification

One is generally interested in testing the null hypothesis that a contrast of regression coefficients is equal to zero, i.e., \mathcal{H}_{0} : \mathbf{C}'\boldsymbol{\psi}\mathbf{D} = \boldsymbol{0}, where \mathbf{C} is a R \times S full-rank matrix of S contrasts of coefficients on the regressors encoded in \mathbf{X}, 1 \leqslant S \leqslant R and \mathbf{D} is a K \times Q full-rank matrix of Q contrasts of coefficients on the dependent, response variables in \mathbf{Y}, 1 \leqslant Q \leqslant K; if K=1 or Q=1, the model is univariate. Once the hypothesis has been established, \mathbf{Y} can be equivalently redefined as \mathbf{Y}\mathbf{D}, such that the contrast \mathbf{D} can be omitted for simplicity, and the null hypothesis stated as \mathcal{H}_{0} : \mathbf{C}'\boldsymbol{\psi} = \boldsymbol{0}.

Model partitioning

It is useful to consider a transformation of the model into a partitioned one:

\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon}

where \mathbf{X} is the matrix with regressors of interest, \mathbf{Z} is the matrix with nuisance regressors, and \boldsymbol{\beta} and \boldsymbol{\gamma} are respectively the vectors of regression coefficients. From this model we can also define the projection (hat) matrices \mathbf{H}_{\mathbf{X}}=\mathbf{X}\mathbf{X}^{+} and \mathbf{H}_{\mathbf{Z}}=\mathbf{Z}\mathbf{Z}^{+} due to tue regressors of interest and nuisance, respectively, and the residual-forming matrices \mathbf{R}_{\mathbf{X}}=\mathbf{I}-\mathbf{H}_{\mathbf{X}} and \mathbf{R}_{\mathbf{Z}}=\mathbf{I}-\mathbf{H}_{\mathbf{Z}}.

Such partitioning is not unique, and schemes can be as simple as separating apart the columns of \mathbf{M} as \left[ \mathbf{X} \; \mathbf{Z} \right], with \boldsymbol{\psi} = \left[ \boldsymbol{\beta}' \; \boldsymbol{\gamma}' \right]'. More involved strategies can, however, be devised to obtain some practical benefits. One such partitioning is to define \mathbf{X} = \mathbf{M} \mathbf{Q} \mathbf{C} \left(\mathbf{C}'\mathbf{Q}\mathbf{C}\right)^{-1} and
\mathbf{Z} = \mathbf{M} \mathbf{Q} \mathbf{C}_v \left(\mathbf{C}_v'\mathbf{Q}\mathbf{C}_v\right)^{-1}, where \mathbf{Q}=(\mathbf{M}'\mathbf{M})^{-1}, \mathbf{C}_v=\mathbf{C}_u-\mathbf{C}(\mathbf{C}'\mathbf{Q}\mathbf{C})^{-1}\mathbf{C}'\mathbf{Q}\mathbf{C}_u, and \mathbf{C}_u has r-\mathsf{rank}\left(\mathbf{C}\right) columns that span the null space of \mathbf{C}, such that [\mathbf{C} \; \mathbf{C}_u] is a r \times r invertible, full-rank matrix (Smith et al, 2007). This partitioning has a number of features: \boldsymbol{\hat{\beta}} = \mathbf{C}'\boldsymbol{\hat{\psi}}, \widehat{\mathsf{Cov}}(\boldsymbol{\hat{\beta}}) = \mathbf{C}'\widehat{\mathsf{Cov}}(\boldsymbol{\hat{\psi}})\mathbf{C}, i.e., estimates and variances of \boldsymbol{\beta} for inference on the partitioned model correspond exactly to the same inference on the original model, \mathbf{X} is orthogonal to \mathbf{Z}, and \mathsf{span}(\mathbf{X}) \cup \mathsf{span}(\mathbf{Z}) = \mathsf{span}(\mathbf{M}), i.e., the partitioned model spans the same space as the original.

Another partitioning scheme, derived by Ridgway (2009), defines \mathbf{X}=\mathbf{M}(\mathbf{C}^+)' and \mathbf{Z}=\mathbf{M}-\mathbf{M}\mathbf{C}\mathbf{C}^{+}. As with the previous strategy, the parameters of interest in the partitioned model are equal to the contrast of the original parameters. A full column rank nuisance partition can be obtained from the singular value decomposition (SVD) of \mathbf{Z}, which will also provide orthonormal columns for the nuisance partition. Orthogonality between regressors of interest and nuisance can be obtained by redefining the regressors of interest as \mathbf{R}_{\mathbf{Z}}\mathbf{X}.

The usual multivariate statistics

For the multivariate statistics, define generically:

\mathbf{H}=(\mathbf{C}'\boldsymbol{\hat{\psi}}\mathbf{D})' \left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1} (\mathbf{C}'\boldsymbol{\hat{\psi}}\mathbf{D})

as the sums of products explained by the model (hypothesis) and:

\mathbf{E} = (\boldsymbol{\hat{\epsilon}}\mathbf{D})'(\boldsymbol{\hat{\epsilon}}\mathbf{D})

as the sums of the products of the residuals, i.e., that remain unexplained. With the simplification to the original model that redefined \mathbf{Y} as \mathbf{Y}\mathbf{D}, the \mathbf{D} can be dropped, so that we have \mathbf{H}=(\mathbf{C}'\boldsymbol{\hat{\psi}})' \left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1} (\mathbf{C}'\boldsymbol{\hat{\psi}}) and \mathbf{E} = \boldsymbol{\hat{\epsilon}}'\boldsymbol{\hat{\epsilon}}. The various well-known multivariate statistics (see this earlier blog entry) can be written as a function of \mathbf{H} and \mathbf{E}. Pillai’s trace is:

T=\mathsf{trace}\left(\mathbf{H}(\mathbf{H}+\mathbf{E})^{-1}\right)

More simplifications

With the partitioning, other simplifications are possible:

\mathbf{H}=\boldsymbol{\hat{\beta}}' (\mathbf{X}'\mathbf{X})\boldsymbol{\hat{\beta}} = ( \mathbf{X}\boldsymbol{\hat{\beta}})'(\mathbf{X}\boldsymbol{\hat{\beta}})

Recalling that \mathbf{X}'\mathbf{Z}=\mathbf{0}, and defining \tilde{\mathbf{Y}}=\mathbf{R}_{\mathbf{Z}}\mathbf{Y}, we have:

\mathbf{H} = (\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}})'(\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}}) = \tilde{\mathbf{Y}}'\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}}

The unexplained sums of products can be written in a similar manner:

\mathbf{E} = (\mathbf{R}_{\mathbf{X}}\tilde{\mathbf{Y}})'(\mathbf{R}_{\mathbf{X}}\tilde{\mathbf{Y}}) = \tilde{\mathbf{Y}}'\mathbf{R}_{\mathbf{X}}\tilde{\mathbf{Y}}

The term \mathbf{H}+\mathbf{E} in the Pillai’s trace can therefore be rewritten as:

\mathbf{H}+\mathbf{E}= \tilde{\mathbf{Y}}'(\mathbf{H}_{\mathbf{X}} + \mathbf{R}_{\mathbf{X}})\tilde{\mathbf{Y}} = \tilde{\mathbf{Y}}'\tilde{\mathbf{Y}}

Using the property that the trace of a product is invariant to a circular permutation of the factors, Pillai’s statistic can then be written as:

\begin{array}{ccl} T&=&\mathsf{trace}\left(\tilde{\mathbf{Y}}'\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}}\left(\tilde{\mathbf{Y}}'\tilde{\mathbf{Y}}\right)^{+}\right)\\ {}&=&\mathsf{trace}\left(\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}}\left(\tilde{\mathbf{Y}}'\tilde{\mathbf{Y}}\right)^{+}\tilde{\mathbf{Y}}'\right)\\ {}&=&\mathsf{trace}\left(\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}}\tilde{\mathbf{Y}}^{+}\right)\\ \end{array}

The final, alternative form

Using sigular value decomposition we have \tilde{\mathbf{Y}} = \mathbf{U}\mathbf{S}\mathbf{V}' and \tilde{\mathbf{Y}}^{+} = \mathbf{V}\mathbf{S}^{+}\mathbf{U}', where \mathbf{U} contains only the columns that correspond to non-zero eigenvalues. Thus, the above can be rewritten as:

\begin{array}{ccl} T&=&\mathsf{trace}\left(\mathbf{H}_{\mathbf{X}} \mathbf{U}\mathbf{S}\mathbf{V}' \mathbf{V}\mathbf{S}^{+}\mathbf{U}' \right)\\ {}&=&\mathsf{trace}\left(\mathbf{H}_{\mathbf{X}} \mathbf{U}\mathbf{U}' \right)\\ \end{array}

The SVD transformation is useful for languages or libraries that offer a fast implementation. Otherwise, using a pseudoinverse yields the same result, perhaps only slightly slower. In this case, T=\mathsf{trace}\left(\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}}\tilde{\mathbf{Y}}^{+}\right).

Importance

If we define \mathbf{A}\equiv\mathbf{H}_{\mathbf{X}} and \mathbf{W}\equiv\mathbf{U}\mathbf{U}' (or \mathbf{W}\equiv\tilde{\mathbf{Y}}\tilde{\mathbf{Y}}^{+}), then T=\mathsf{trace}\left(\mathbf{A}\mathbf{W}\right). The first three moments of the permutation distribution of statistics that can be written in this form can be computed analytically once \mathbf{A} and \mathbf{W} are known. With the first three moments, a gamma distribution (Pearson type III) can be fit, thus allowing p-values to be computed without resorting to the usual beta approximation to Pillai’s trace, nor using permutations, yet with results that are not based on the assumption of normality (Mardia, 1971; Kazi-Aoual, 1995; Minas and Montana, 2014).

Availability

This simplification is available in PALM, for use with imaging and non-imaging data, using Pillai’s trace itself, or a modification that allows inference on univariate statistics. As of today, this option is not yet documented, but should become openly available soon.

References

Update: 20.Jan.2016: A slight simplification was applied to the formulas above so as to make them more elegant and remove some redundancy. The result is the same.

All GLM formulas

It’s so often that we find ourselves in the need to quickly compute a statistic for a certain dataset, but finding the formulas is not always direct. Using a statistical software is helpful, but it often also happens that the reported results are not exactly what one may believe it represents. Moreover, even if using these packages, it is always good to have in mind the meaning of statistics and how they are computed. Here the formulas for the most commonly used statistics with the general linear model (glm) are presented, all in matrix form, that can be easily implemented in Octave or Matlab.

I — Model

We consider two models, one univariate, another multivariate. The univariate is a particular case of the multivariate, but for univariate problems, it is simpler to use the smaller, particular case.

Univariate model

The univariate glm can be written as:

\mathbf{y} = \mathbf{M}\boldsymbol{\psi} + \boldsymbol{\epsilon}

where \mathbf{y} is the N \times 1 vector of observations, \mathbf{M} is the N \times s matrix of explanatory variables, \boldsymbol{\psi} is the s \times 1 vector of regression coefficients, and \boldsymbol{\epsilon} is the N \times 1 vector of residuals.

The null hypothesis can be stated as \mathcal{H}_0 : \mathbf{C}'\boldsymbol{\psi} = 0, where \mathbf{C} is a s \times s' matrix that defines a contrast of regression coefficients, satisfying \mathsf{rank}(\mathbf{C}) = s' and 1 \geqslant s' \geqslant s.

Multivariate model

The multivariate glm can be written as:

\mathbf{Y} = \mathbf{M}\boldsymbol{\Psi} + \boldsymbol{\epsilon}

where \mathbf{Y} is the N \times q vector of observations, \mathbf{M} is the N \times s matrix of explanatory variables, \boldsymbol{\Psi} is the s \times q vector of regression coefficients, and \boldsymbol{\epsilon} is the N \times q vector of residuals.

The null hypothesis can be stated as \mathcal{H}_0 : \mathbf{C}'\boldsymbol{\Psi}\mathbf{D} = 0, where \mathbf{C} is defined as above, and \mathbf{D} is a q \times q' matrix that defines a contrast of observed variables, satisfying \mathsf{rank}(\mathbf{D}) = q' and 1 \geqslant q' \geqslant q.

II — Estimation of parameters

In the model, the unknowns of interest are the values arranged in \boldsymbol{\Psi}. These can be estimated as:

\boldsymbol{\hat{\Psi}} = (\mathbf{M}'\mathbf{M})^{+}(\mathbf{M}'\mathbf{Y})

where the ^{+} represents a pseudo-inverse. The residuals can be computed as:

\boldsymbol{\hat{\epsilon}} = \mathbf{Y} - \mathbf{M}\boldsymbol{\hat{\Psi}}

The above also applies to the univariate case (\mathbf{y} is a particular case of \mathbf{Y}, and \boldsymbol{\psi} of \boldsymbol{\Psi}).

III – Univariate statistics

Coefficient of determination, R2

The following is the fraction of the variance explained by the part of the model determined by the contrast. It applies to mean-centered data and design, i.e., \tilde{\mathbf{y}}=\mathbf{y}-\bar{y} and \tilde{\mathbf{M}}=\mathbf{M}-\bar{\mathbf{m}}.

R^2 = \dfrac{\boldsymbol{\hat{\psi}}'\mathbf{C} \left(\mathbf{C}'(\tilde{\mathbf{M}}'\tilde{\mathbf{M}})^{-1}\mathbf{C} \right)^{-1} \mathbf{C}'\boldsymbol{\hat{\psi}}}{\tilde{\mathbf{y}}'\tilde{\mathbf{y}}}

Note that the portion of the variance explained by nuisance variables (if any) remains in the denominator. To have these taken into account, consider the squared partial correlation coefficient, or Pillai’s trace with univariate data, both described further down.

Pearson’s correlation coefficient

When \mathsf{rank}\left(\mathbf{C}\right) = 1, the multiple correlation coefficient can be computed from the R^2 statistic as:

R = \mathsf{sign}\left(\mathbf{C}'\boldsymbol{\hat{\psi}}\right)\sqrt{R^2}

This value should not be confused, even in the presence of nuisance, with the partial correlation coefficient (see below).

Student’s t statistic

If \mathsf{rank}\left(\mathbf{C}\right) = 1, the Student’s t statistic can be computed as:

t = \boldsymbol{\hat{\psi}}'\mathbf{C} \left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-\frac{1}{2}} \left/ \sqrt{\dfrac{\boldsymbol{\hat{\epsilon}}'\boldsymbol{\hat{\epsilon}}}{N-\mathsf{rank}\left(\mathbf{M}\right)}} \right.

F statistic

The F statistic can be computed as:

F = \left.\dfrac{\boldsymbol{\hat{\psi}}'\mathbf{C} \left( \mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1} \mathbf{C}'\boldsymbol{\hat{\psi}}}{\mathsf{rank}\left(\mathbf{C} \right)} \right/ \dfrac{\boldsymbol{\hat{\epsilon}}'\boldsymbol{\hat{\epsilon}}}{N-\mathsf{rank}\left(\mathbf{M}\right)}

Aspin—Welch v

If homoscedastic variances cannot be assumed, and \mathsf{rank}\left(\mathbf{C}\right) = 1, this is equivalent to the Behrens—Fisher problem, and the Aspin—Welch’s t statistic can be computed as:

v = \boldsymbol{\hat{\psi}}'\mathbf{C} \left(\mathbf{C}'(\mathbf{M}'\mathbf{W}\mathbf{M})^{-1}\mathbf{C} \right)^{-\frac{1}{2}}

where \mathbf{W} is a diagonal matrix that has elements:

W_{nn}=\dfrac{\sum_{n' \in g_{n}}R_{n'n'}}{\boldsymbol{\hat{\epsilon}}_{g_{n}}'\boldsymbol{\hat{\epsilon}}_{g_{n}}}

and where R_{n'n'} are the n' diagonal elements of the residual forming matrix \mathbf{R} = \mathbf{I}-\mathbf{M}\mathbf{M}^{+}, and g_{n} is the variance group to which the n-th observation belongs.

Generalised G statistic

If variances cannot be assumed to be the same across all observations, a generalisation of the F statistic can be computed as:

G = \dfrac{\boldsymbol{\hat{\psi}}'\mathbf{C} \left(\mathbf{C}'(\mathbf{M}'\mathbf{W}\mathbf{M})^{-1}\mathbf{C} \right)^{-1} \mathbf{C}'\boldsymbol{\hat{\psi}}}{\Lambda \cdot \mathsf{rank}\left(\mathbf{C}\right)}

where \mathbf{W} is defined as above, and the remaining denominator term, \Lambda, is given by:

\Lambda = 1+\frac{2(s-1)}{s(s+2)}\sum_{g} \frac{1}{\sum_{n \in g}R_{nn}} \left(1-\frac{\sum_{n \in g}W_{nn}}{\mathsf{trace}\left(\mathbf{W}\right)}\right)^2

There is another post on the G-statistic (here).

Partial correlation coefficient

When \mathsf{rank}\left(\mathbf{C}\right) = 1, the partial correlation can be computed from the Student’s t statistic as:

r = \mathsf{sign}\left(t\right)\sqrt{\dfrac{t^2}{N-\mathsf{rank}\left(\mathbf{M}\right)+t^2}}

The square of the partial correlation corresponds to Pillai’s trace applied to an univariate model, and it can also be computed from the F-statistic as:

r^2 = \dfrac{F}{\frac{N-\mathsf{rank}\left(\mathbf{M}\right)}{\mathsf{rank}\left(\mathbf{C}\right)}+F}

Likewise, if r is known, the formula can be solved for t:

t = \dfrac{r\sqrt{N-\mathsf{rank}\left(\mathbf{M}\right)}}{\sqrt{1-r^2}}

or for F:

F = \dfrac{r^2}{1-r^2}\times\dfrac{N-\mathsf{rank}\left(\mathbf{M}\right)}{\mathsf{rank}\left(\mathbf{C}\right)}

The partial correlation can also be computed at once for all variables vs. all other variables as follows. Let \mathbf{A} = \left[\mathbf{y}\; \mathbf{M}\right], and \mathsf{r}\left(\mathbf{A}\right) be the inverse of the correlation matrix of the columns of \mathbf{A}, and \mathsf{d}\left(\cdot\right) the diagonal operator, that returns a column vector with the diagonal entries of a square matrix. Then the matrix with the partial correlations is:

\mathbf{r} = -\mathsf{r}\left(\mathbf{A}\right) \odot \left(\mathsf{d}\left(\mathsf{r}\left(\mathbf{A}\right)\right)\mathsf{d}\left(\mathsf{r}\left(\mathbf{A}\right)\right)'\right)^{-\frac{1}{2}}

where \odot is the Hadamard product, and the power “-\frac{1}{2}” is taken elementwise (i.e., not matrix power).

IV – Multivariate statistics

For the multivariate statistics, define generically \mathbf{E} = (\boldsymbol{\hat{\epsilon}}\mathbf{D})'(\boldsymbol{\hat{\epsilon}}\mathbf{D}) as the sums of the products of the residuals, and \mathbf{H}=(\mathbf{C}'\boldsymbol{\hat{\Psi}}\mathbf{D})' \left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1} (\mathbf{C}'\boldsymbol{\hat{\Psi}}\mathbf{D}) as the sums of products of the hypothesis. In fact, the original model can be modified as \tilde{\mathbf{Y}} = \mathbf{M}\tilde{\boldsymbol{\Psi}} + \tilde{\boldsymbol{\epsilon}}, where \tilde{\mathbf{Y}}=\mathbf{Y}\mathbf{D}, \tilde{\boldsymbol{\Psi}} = \boldsymbol{\Psi}\mathbf{D} and \tilde{\boldsymbol{\epsilon}}=\boldsymbol{\epsilon}\mathbf{D}.

If \mathsf{rank}\left(\mathbf{D}\right)=1, this is an univariate model, otherwise it remains multivariate, although \mathbf{D} can be omitted from the formulas. From now on this simplification is adopted, so that \mathbf{E} = \boldsymbol{\hat{\epsilon}}'\boldsymbol{\hat{\epsilon}} and \mathbf{H}=\boldsymbol{\hat{\Psi}}'\mathbf{C} \left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1} \mathbf{C}'\boldsymbol{\hat{\Psi}}.

Hotelling T2

If \mathsf{rank}\left(\mathbf{C}\right) = 1, the Hotelling’s T^2 statistic can be computed as:

T^2 = \mathbf{C}'\boldsymbol{\hat{\Psi}}\boldsymbol{\Sigma}^{-\frac{1}{2}}\left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1}\boldsymbol{\Sigma}^{-\frac{1}{2}} \boldsymbol{\hat{\Psi}}'\mathbf{C}

where \boldsymbol{\Sigma} = \mathbf{E}/\left(N-\mathsf{rank}\left(\mathbf{M}\right)\right)

Multivariate alternatives to the F statistic

Classical manova/mancova statistics can be based in the ratio between the sums of products of the hypothesis and the sums of products of the residuals, or the ratio between the sums of products of the hypothesis and the total sums of products. In other words, define:

\begin{array}{ccl} \mathbf{A} &=& \mathbf{H}\mathbf{E}^{-1} = \mathbf{E}^{-\frac{1}{2}} \boldsymbol{\hat{\psi}}'\mathbf{C} \left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1} \mathbf{C}'\boldsymbol{\hat{\psi}}\mathbf{E}^{-\frac{1}{2}}\\ \mathbf{B} &=& \mathbf{H}\left(\mathbf{E}+\mathbf{H}\right)^{-1} \end{array}

Let \lambda_i be the eigenvalues of \mathbf{A}, and \theta_i the eigenvalues of \mathbf{B}. Then:

  • Wilks’ \Lambda = \prod_{i} \dfrac{1}{1+\lambda_i} = \dfrac{|\mathbf{E}|}{|\mathbf{E}+\mathbf{H}|}.
  • Lawley–Hotelling’s trace: \sum_i \lambda_i = \mathsf{trace}\left(\mathbf{A}\right).
  • Pillai’s trace: \sum_i \dfrac{\lambda_i}{1+\lambda_i} = \sum_i \theta_i = \mathsf{trace}\left(\mathbf{B}\right).
  • Roy’s largest root (ii): \lambda_1 = \mathsf{max}_i\left(\lambda_i\right) = \dfrac{\theta_1}{1-\theta_1} (analogous to F).
  • Roy’s largest root (iii): \theta_1 = \mathsf{max}_i\left(\theta_i\right) = \dfrac{\lambda_1}{1+\lambda_1} (analogous to R^2).

When \mathsf{rank}\left(\mathbf{C}\right) = 1, or when \mathbf{Y} is univariate, or both, Lawley–Hotelling’s trace is equal to Roy’s (ii) largest root, Pillai’s trace is equal to Roy’s (iii) largest root, and Wilks’ \Lambda added to Pillai’s trace equals to unity. The value \rho_i=\sqrt{\theta_i} is the i-th canonical correlation.

References