# The German tank problem and the novel coronavirus

We often think of statistics as a way to summarize large amounts of data. For example, we can collect data from thousands of subjects, and extract a single number that tells something about these subjects. The well known German tank problem shows that, in a certain way, statistics can also be used for the opposite: using incomplete data and a few reasonable assumptions (or real knowledge), statistics provides way to estimate information that offer a panoramic view of all the data. Historical problems are interesting on their own. Yet, it is not always that we see so clearly consequential historical events at the time they happen — like now.

In the Second World War, as in any other war, information could be more valuable than anything else. Intelligence reports (such as from spies) would feed the Allies with information about the industrial capacity of Nazi Germany, including details about things such as the number of tanks produced. This kind of information can have far reaching impact and not only determine the outcome of a battle, but also if a battle would even even happen or with what preparations, as the prospect of finding a militarily superior opponent is often a great deterrent.

Sometimes, German tanks, as the well known Panzer, could be captured and carefully inspected. Among the details noted were the serial number printed in various pieces, such as chassis, gearboxes, and the serial numbers of the moulds used to produce the wheels. With the serial number of even a single chassis, for example one can estimate the total number of tanks produced; knowing the serial number of a single wheel mould allows the estimation of the total number of moulds, and thus, how many wheels can be produced in a certain amount of time. But how?

If serial numbers are indeed serial, e.g., $\{1, 2, 3, \ldots, N\}$, growing uniformly and without gaps, and we see a tank that has a serial number $S$, then clearly at least $S$ tanks must have been produced. But could we have a better guess?

Let’s start by reversing the problem: suppose we knew $N$. In that case, what would be the average value of the serial numbers of all $N$ tanks? The average for uniformly distributed data like this would be $M = \frac{1 + N}{2}$, that is, the average of the first and last serial numbers.

Now, say we have only one sighting of a tank, and that has serial number $S$. Then our best guess for the average serial number is $S$ itself, as we have no additional information. Thus, with $M = S$, our guess would be $N = 2S - 1$ (that is, reorganizing the terms of the previous equation for $M$). Note that, for one sighting, this formula guarantees that $N$ is larger or equal than $S$, which makes sense: we cannot have an estimate for $N$ that is smaller than the serial number $S$ itself.

What if we had not just one, but multiple sightings? Call the number of sightings $K$. The mean is now $M = \frac{S_1 + S_2 + \ldots + S_K}{K}$, for ordered serial numbers $\{S_1, S_2, \ldots, S_K\}$. Clearly, we can’t use the same formula, because if $M$ is much smaller than $S_K$ (say, because we have seen many small serial numbers, but just a handful of larger ones), $N$ could incorrectly be estimated as less than $S_K$, which makes no sense. At least $S_K$ tanks must exist.

While incorrect for $K > 1$, the above formula gives invaluable insight: it shows that for such uniformly distributed data, approximately half of the tanks have serial number above $M$, the other half below $M$. Extending the idea, and still under the assumption that the serial numbers are uniform, we can conclude that the number of tanks below the lowest serial number $S_1$ (which is $S_1 - 1$) must be approximately the same as the (unknown) number of tanks above the highest serial number $S_K$. So, a next better estimate could be to use $N = S_K + S_1 - 1$.

We can still do better, though. Since we have $K$ sightings, we can estimate what is the average interval between sightings, i.e., $\frac{S_K}{K}$. As it is based on all $K$ sightings, this gives a better estimate of the spacing between the serial numbers than the single sighting $S_1$. The result can be added to $S_K$. The final estimate then becomes $N = S_K + \frac{S_K}{K} - 1$.

To make this concrete, say we saw tanks numbered $\{47, 62, 104, 155, 159\}$. Then our best guess would be $N \approx 190$.

At the end of the war, estimates obtained using the above method proved remarkably accurate, much more so than information provided by spies and other intelligence reports.

Let’s now see a similar example that is contemporary to us. Take the current pandemic caused by a novel coronavirus. The World Health Organization stated officially, in 14th January 2020, when there were 41 cases officially reported in China, that there was no evidence for human-to-human transmission. Yet, when the first 3 cases outside China were confirmed in 16th January 2020, epidemiologists at the Imperial College London were quick to find out that the WHO statement must have not been true. Rather, the real number of cases was likely well above 1700.

How did they make that estimate? The key insight was the realisation that only a small number of people in any major city travels internationally, particularly in such a short time span like that given by the time until the onset of symptoms for this kind of respiratory disease. If one can estimate prevalence among those who travelled, that would be a good approximation to the prevalence among those who live in the city, assuming that those who travel are an unbiased sample of the population.

Following this idea, we have: $\frac{C_t}{N_t} \approx \frac{C_s}{N_s}$, that is, the number of cases among those who travelled ($C_t$) divided by the total number of people who travelled ($N_t$) is expected to be approximately the same as the number of cases among those who stayed ($C_s$) divided by the total number of people who stayed (live) in the city ($N_s$).

The number of people served by the international airport of Wuhan is about 19 million (the size of the Wuhan metropolitan area), and the average daily number of outbound international passengers in previous years was 3301 for that time of the year (a figure publicly known, from IATA). Unfortunately, little was known outside China about the time taken between exposure to the virus and the onset of symptoms. The researchers then resorted to a proxy: the time known for the related severe respiratory disorder known as MERS, also caused by a coronavirus, which is about 10 days. Thus, we can estimate $N_t= 3301 \times 10=33010$ people travelling out, and $N_s = 19,000,000$ staying in the city. The number of known international cases was at the time $C_t = 3$. Hence:

$C_s \approx \frac{3\times 19,000,000}{33010}\approx 1727$ cases

So, using remarkably simple maths, simpler even than in our WWII German tank example, the scientists estimated that the number of actual cases in the city of Wuhan was likely far above the official figure of 41 cases. The researchers were careful to indicate that, should the probability of travelling be higher among those exposed, the number of actual cases could be smaller. The converse is true: should travellers be wealthier (thus less likely to be exposed to a possible zoonosis as initially reported), the number of actual cases could be higher.

Importantly, it is not at all likely that 1700 people would have contracted such a zoonosis from wild animals in a dense urban area like Wuhan, hence human-to-human transmission must had been occurring. Eventually the WHO confirmed human-to-human transmission on 19th January 2020. Two days later, Chinese authorities began locking down and sealing off Wuhan, thus putting into place a plan to curb the transmission.

To find out more about the original problem of the number of tanks, and also for other methods of estimation for the same problem, a good start is this article. Also invaluable, for various estimation problems related to the fast dissemination of the novel coronavirus, are all the reports by the epidemiology team at the Imperial College London, which can be found here.

# Redundancy in canonical correlation analysis

In canonical correlation analysis (CCA; Hotelling, 1936), the absolute value of a correlation is not always that helpful. For example, large canonical correlations may arise simply due to a large number of variables being investigated using a relatively small sample size; high correlations may arise simply because there are too many opportunities for finding mixtures in both sides that are highly correlated one with another.

Motivated by this perceived difficulty in the interpretation of results, Stewart and Love (1968) proposed the computation of what has been termed a redundancy index. It works as follows.

Let $\mathbf{Y}_{N \times P}$ and $\mathbf{X}_{N \times Q}$ be two sets of variables over which CCA is computed. We find canonical coefficients $\mathbf{A}_{P \times K}$ and $\mathbf{B}_{Q \times K}$, $K=\min(P,Q)$ such that the canonical variables $\mathbf{U}_{N \times K}$ and $\mathbf{V}_{N \times K}$ have maximal, diagonal correlation structure; this diagonal contains the ordered canonical correlations $r_k$.

Now that CCA has been computed, we can find the correlations between the original variables and the canonical coefficients. Let $\mathbf{\tilde{A}}_{P \times K}=\text{corr}(\mathbf{Y},\mathbf{U})$ and $\mathbf{\tilde{B}}_{Q \times K}=\text{corr}(\mathbf{X},\mathbf{V})$ be such correlations, which are termed canonical loadings or structure coefficients. Now compute the mean square for each of the columns of $\mathbf{\tilde{A}}$ and $\mathbf{\tilde{B}}$. These represent the variance extracted by the corresponding canonical variable. That is:

• Variance extracted by canonical variable $\mathbf{u}_{k}$: $\upsilon_k = \frac{1}{P}\sum_{p=1}^{P}\tilde{a}_{pk}^{2}$
• Variance extracted by canonical variable $\mathbf{v}_{k}$: $\nu_k = \frac{1}{Q}\sum_{q=1}^{Q}\tilde{b}_{qk}^{2}$

These quantities represent the mean variance extracted from the original variables by each of the canonical variables (in each side).

Compute now the proportion of variance of one canonical variable (say, $\mathbf{u}_{k}$) explained by the corresponding canonical variable in the other side (say, $\mathbf{v}_{k}$). This is given simply by $r_k^2$, the usual coefficient of determination.

The redundancy index for each canonical variable is then the product of $\upsilon_k$ and $r_k^2$ for the left side of CCA, and the product of $\nu_k$ and $r_k^2$ for the right side. That is, the index is not symmetric. It measures the proportion of variance in one of the two set of variables explained by the correlation between the $k$-th pair of canonical variables.

The sum of the redundancies for all $K$ canonical variables in one side or another forms a global redundancy metric, which indicates the proportion of the variance in a given side explained by the variance in the other.

This global redundancy can be scaled to unity, such that the redundancies for each of the canonical variables in a give side can be interpreted as the proportion of total redundancy.

If you follow the original paper by Stewart and Love (1968), $\upsilon_k$ and $\nu_k$ are column III of Table 2, the redundancy of each canonical variable for each side corresponds to column IV, and the proportion of total redundancy is in column V.

Another reference on the same topic that is worth looking is Miller (1981). In it, the author discusses that redundancy is somewhere in between CCA itself (fully symmetric) and multiple regression (fully asymmetric).

## Reference

Update

• 28.Jun.2020: A script that computes the redundancy indices is available here: redundancy.m (work with Thomas Wassenaar, University of Oxford).

# Better statistics, faster

Permutation tests are more robust and help to make scientific results more reproducible by depending on fewer assumptions. However, they are computationally intensive as recomputing a model thousands of times can be slow. The purpose of this post is to briefly list some options available for speeding up permutation.

Firstly, no speed-ups may be needed: for small sample sizes, or low resolutions, or small regions of interest, a permutation test can run in a matter of minutes. For larger data, however, accelerations may be of use. One option is acceleration through parallel processing or GPUs (for example applications of the latter, see Eklund et al., 2012, Eklund et al., 2013 and Hernández et al., 2013; references below), though this does require specialised implementation. Another option is to reduce the computational burden by exploiting the properties of the statistics and their distributions. A menu of options includes:

• Do few permutations (shorthand name: fewperms). The results remain valid on average, although the p-values will have higher variability.
• Keep permuting until a fixed number of permutations with statistic larger than the unpermuted is found (a.k.a., negative binomial; shorthand name: negbin).
• Do a few permutations, then approximate the tail of the permutation distribution by fitting a generalised Pareto distribution to its tail (shorthand name: tail).
• Approximate the permutation distribution with a gamma distribution, using simple properties of the test statistic itself, amazingly not requiring any permutations at all (shorthand name: noperm).
• Do a few permutations, then approximate the full permutation distribution by fitting a gamma distribution (shorthand name: gamma).
• Run permutations on only a few voxels, then fill the missing ones using low-rank matrix completion theory (shorthand name: lowrank).

These strategies allow accelerations >100x, yielding nearly identical results as in the non-accelerated case. Some, such as tail approximation, are generic enough to be used nearly all the most common scenarios, including univariate and multivariate tests, spatial statistics, and for correction for multiple testing.

In addition to accelerating permutation tests, some of these strategies, such as tail and noperm, allow continuous p-values to be found, and refine the p-values far into the tail of the distribution, thus avoiding the usual discreteness of p-values, which can be a problem in some applications if too few permutations are done.

These methods are available in the tool PALM — Permutation Analysis of Linear Models — and the complete description, evaluation, and application to the re-analysis of a voxel-based morphometry study (Douaud et al., 2007) have been just published in Winkler et al., 2016 (for the Supplementary Material, click here). The paper includes a flow chart prescribing these various approaches for each case, reproduced below.

The hope is that these accelerations will facilitate the use of permutation tests and, if used in combination with hardware and/or software improvements, can further expedite computation leaving little reason not to use these tests.

## References

Contributed to this post: Tom Nichols, Ged Ridgway.

# Extreme value notes

Extreme values are useful to quantify the risk of catastrophic floods, and much more.

This is a brief set of notes with an introduction to extreme value theory. For reviews, see Leadbetter et al (1983) and David and Huser (2015) [references at the end]. Also of some (historical) interest might be the classical book by Gumbel (1958). Let $X_1, \dots, X_n$ be a sequence of independent and identically distributed variables with cumulative distribution function (cdf) $F(x)$ and let $M_n =\max(X_1,\dots,X_n)$ denote the maximum.

If $F(x)$ is known, the distribution of the maximum is:

$\begin{array}{lll} P(M_n \leqslant x) &=&P(X_1 \leqslant x, \dots, X_n \leqslant x) \\ &=& P(X_1 \leqslant x) \cdots P(X_n \leqslant x) = F^n(x). \end{array}$

The distribution function $F(x)$ might, however, not be known. If data are available, it can be estimated, although small errors on the estimation of $F(x)$ can lead to large errors concerning the extreme values. Instead, an asymptotic result is given by the extremal types theorem, also known as Fisher-Tippett-Gnedenko Theorem, First Theorem of Extreme Values, or extreme value trinity theorem (called under the last name by Picklands III, 1975).

But before that, let’s make a small variable change. Working with $M_n$ directly is problematic because as $n \rightarrow \infty$, $F^n(x) \rightarrow 0$. Redefining the problem as a function of $M_n^* = \frac{M_n-b_n}{a_n}$ renders treatment simpler. The theorem can be stated then as: If there exist sequences of constants $a_n \in \mathbb{R}_{+}$ and $b_n \in \mathbb{R}$ such that, as $n \rightarrow \infty$:

$P\left(M_{n}^{*} \leqslant x \right) \rightarrow G(x)$

then $G(x)$ belongs to one of three “domains of attraction”:

• Type I (Gumbel law): $\Lambda(x) = e^{-e^{-\frac{x-b}{a}}}$, for $x \in \mathbb{R}$ indicating that the distribution of $M_n$ has an exponential tail.
• Type II (Fréchet law): $\Phi(x) = \begin{cases} 0 & x\leqslant b \\ e^{-\left(\frac{x-b}{a}\right)^{-\alpha}} & x\; \textgreater\; b \end{cases}$ indicating that the distribution of $M_n$ has a heavy tail (including polynomial decay).
• Type III (Weibull law): $\Psi(x) = \begin{cases} e^{-\left( -\frac{x-b}{a}\right)^\alpha} & x\;\textless\; b \\ 1 & x\geqslant b \end{cases}$ indicating that the distribution of $M_n$ has a light tail with finite upper bound.

Note that in the above formulation, the Weibull is reversed so that the distribution has an upper bound, as opposed to a lower one as in the Weibull distribution. Also, the parameterisation is slightly different than the one usually adopted for the Weibull distribution.

These three families have parameters $a\; \textgreater\; 0$, $b$ and, for families II and III, $\alpha\; \textgreater\; 0$. To which of the three a particular $F(x)$ is attracted is determined by the behaviour of the tail of of the distribution for large $x$. Thus, we can infer about the asymptotic properties of the maximum while having only a limited knowledge of the properties of $F(x)$.

These three limiting cases are collectively termed extreme value distributions. Types II and III were identified by Fréchet (1927), whereas type I was found by Fisher and Tippett (1928). In his work, Fréchet used $M_n^* = \frac{M_n}{a_n}$, whereas Fisher and Tippett used $M_n^* = \frac{M_n-b_n}{a_n}$. Von Mises (1936) identified various sufficient conditions for convergence to each of these forms, and Gnedenko (1943) established a complete generalisation.

## Generalised extreme value distribution

As shown above, the rescaled maxima converge in distribution to one of three families. However, all are cases of a single family that can be represented as:

$G(x) = e^{-\left(1-\xi\left(\frac{x-\mu}{\sigma}\right)\right)^{\frac{1}{\xi}}}$

defined on the set $\left\{x:1-\xi\frac{x-\mu}{\sigma}\;\textgreater\;0\right\}$, with parameters $-\infty \;\textless \;\mu\;\textless\; \infty$ (location), $\sigma\;\textgreater\;0$ (scale), and $-\infty\;\textless\;\xi\;\textless\;\infty$ (shape). This is the generalised extreme value (GEV) family of distributions. If $\xi \rightarrow 0$, it converges to Gumbel (type I), whereas if $\xi < 0$ it corresponds to Fréchet (type II), and if $\xi\;\textgreater\;0$ it corresponds to Weibull (type III). Inference on $\xi$ allows choice of a particular family for a given problem.

## Generalised Pareto distribution

For $u\rightarrow\infty$, the limiting distribution of a random variable $Y=X-u$, conditional on $X \;\textgreater\; u$, is:

$H(y) = 1-\left(1-\frac{\xi y}{\tilde{\sigma}}\right)^{\frac{1}{\xi}}$

defined for $y \;\textgreater\; 0$ and $\left(1-\frac{\xi y}{\tilde{\sigma}}\right) \;\textgreater\; 0$. The two parameters are the $\xi$ (shape) and $\tilde{\sigma}$ (scale). The shape corresponds to the same parameter $\xi$ of the GEV, whereas the scale relates to the scale of the former as $\tilde{\sigma}=\sigma-\xi(u-\mu)$.

The above is sometimes called the Picklands-Baikema-de Haan theorem or the Second Theorem of Extreme Values, and it defines another family of distributions, known as generalised Pareto distribution (GPD). It generalises an exponential distribution with parameter $\frac{1}{\tilde{\sigma}}$ as $\xi \rightarrow 0$, an uniform distribution in the interval $\left[0, \tilde{\sigma}\right]$ when $\xi = 1$, and a Pareto distribution when $\xi \;\textgreater\; 0$.

## Parameter estimation

By restricting the attention to the most common case of $-\frac{1}{2}<\xi<\frac{1}{2}$, which represent distributions approximately exponential, parametters for the GPD can be estimated using at least three methods: maximum likelihood, moments, and probability-weighted moments. These are described in Hosking and Wallis (1987). For $\xi$ outside this interval, methods have been discussed elsewhere (Oliveira, 1984). The method of moments is probably the simplest, fastest and, according to Hosking and Wallis (1987) and Knijnenburg et al (2009), has good performance for the typical cases of $-\frac{1}{2}<\xi<\frac{1}{2}$.

For a set of extreme observations, let $\bar{x}$ and $s^2$ be respectively the sample mean and variance. The moment estimators of $\tilde{\sigma}$ and $\xi$ are $\hat{\tilde{\sigma}} = \frac{\bar{x}}{2}\left(\frac{\bar{x}^2}{s^2}+1\right)$ and $\hat{\xi}=\frac{1}{2}\left(\frac{\bar{x}^2}{s^2}-1\right)$.

The accuracy of these estimates can be tested with, e.g., the Anderson-Darling goodness-of-fit test (Anderson and Darling, 1952; Choulakian and Stephens, 2001), based on the fact that, if the modelling is accurate, the p-values for the distribution should be uniformly distributed.

## Availability

Statistics of extremes are used in PALM as a way to accelerate permutation tests. More details to follow soon.

## References

The figure at the top (flood) is in public domain.

# Non-Parametric Combination (NPC) for brain imaging

Have you ever had an analysis in which there was a large set of contrasts, all of interest, and you were worried about multiple testing? An eventual effect would be missed by a simple Bonferroni correction, but you did not know what else to do? Or did you have a set of different studies and you wished to obtain a style of meta-analytic result, indicating whether there would be evidence across all of them, without requiring the studies to be all consistently significant?

The Non-Parametric Combination (NPC) solves these issues. It is a way of performing joint inference on multiple data collected on the same experimental units (e.g., same subjects), all with minimal assumptions. The method was proposed originally by Pesarin (1990, 1992) [see references below], independently by Blair and Karninski (1993), and described extensively by Pesarin and Salmaso (2010). In this blog entry, the NPC is presented in brief, with emphasis on the modifications we introduce to render it feasible for brain imaging. The complete details are in our paper that has just been published in the journal Human Brain Mapping.

## NPC in a nutshell

The NPC consists of, in a first phase, testing each hypothesis separately using permutations that are performed synchronously across datasets; these tests are termed partial tests. The resulting statistics for each and every permutation are recorded, allowing an estimate of the complete empirical null distribution to be constructed for each one. In a second phase, the empirical p-values for each statistic are combined, for each permutation, into a joint statistic. As such a combined joint statistic is produced from the previous permutations, an estimate of its empirical distribution function is immediately known, and so is the p-value of the joint test. A flowchart of the original algorithm is shown below; click to see it side-by-side with the modified one (described below).

## A host of combining functions

The null hypothesis of the NPC is that null hypotheses for all partial tests are true, and the alternative hypothesis that any is false, which is the same null of a union-intersection test (UIT; Roy, 1953). The rejection region depends on how the combined statistic is produced. Various combining functions, which produce such combined statistics, can be considered, and some of the most well known are listed in the table below:

Method Statistic p-value
Tippett $\min \left(p_{k}\right)$ $1-\left(1-T\right)^{K}$
Fisher $-2 \sum_{k=1}^{K} \ln\left(p_{k}\right)$ $1-\chi^{2}\left(T;\;\nu=2K\right)$
Stouffer $\frac{1}{\sqrt{K}} \sum_{k=1}^{K} \Phi^{-1}\left(1-p_{k}\right)$ $1-\Phi\left(T;\;\mu=0,\;\sigma^2=1\right)$
Mudholkar–George $\frac{1}{\pi}\sqrt{\frac{3(5K+4)}{K(5K+2)}}\sum_{k=1}^{K} \ln\left(\frac{1-p_{k}}{p_{k}}\right)$ $1-t_{\text{cdf}}(T;\;\nu=5K+4)$

In the table, $K$ is the number of partial tests, and the remaining of the variables follow the usual notation (see the Table 1 in the paper for the complete description). Many of these combining functions were proposed over the years for applications such as meta-analyses, and many of them assume independence between the tests being combined, and will give incorrect p-values if such assumption is not met. In the NPC, lack of dependence is not a problem, even if these same functions are used: the synchronised permutations ensure that any dependence, if existing, is taken into account, and this is done so implicitly, with no need for explicit modelling.

The different combining functions lead to different rejection regions for the null hypothesis. For the four combining functions in the table above, the respective rejection regions are in the figure below.

The combining functions can be modified to allow combination of tests so as to favour hypotheses with concordant directions, or be modified for bi-directional tests. Click on the figure above for examples of these cases (again, see the paper for the complete details).

## Two problems, one solution

The multiple testing problem is well known in brain imaging: as an image comprises thousands of voxels/vertices/faces, correction is necessary. Bonferroni is in general too conservative, and various other approaches have been proposed, such as the random field theory. Permutation tests provide control over the familywise error rate (FWER) for the multiple tests across space, requiring only the assumption of exchangeability. This is all well known; see Nichols and Hayasaka (2003) and Winkler et al. (2014) for details.

However, another type of multiple testing is also common: analyses that test multiple hypotheses using the same model, multiple pairwise group comparisons, multiple and distinct models, studies using multiple modalities, that mix imaging and non-imaging data, that consider multiple processing pipelines, and even multiple multivariate analyses. All these common cases also need multiple testing correction. We call this multiple testing problem MTP-II, to discern it from the well known multiple testing problem across space, described above, which we term MTP-I.

One of the many combining functions possible with NPC, the one proposed by Tippett (1931), has a further property that makes it remarkably interesting. The Tippett function uses the smallest p-value across partial tests as its test statistic. Alternatively, if all statistics are comparable, it can be formulated in terms of the maximum statistic. It turns out that the distribution of the maximum statistic across a set of tests is also the distribution that can be used in a closed testing procedure (Marcus et al., 1976) to correct for the familywise error rate (FWER) using resampling methods, such as permutation. In the context of joint inference, FWER-correction can also be seen as an UIT. Thus, NPC offers a link between combination of multiple tests, and correction for multiple tests, in both cases regardless of any dependence between such tests.

This means that the MTP-II, for which correction in the parametric realm is either non-existing or fiendishly difficult, can be accommodated easily. It requires no explicit modelling of the dependence between the tests, and the resulting error rates are controlled exactly at the test level, adding rigour to what otherwise could lead to an excess of false positives without correction, or be overly conservative if a naïve correction such as Bonferroni were attempted.

## Modifying for imaging applications

As originally proposed, in practice NPC cannot be used in brain imaging. As the statistics for all partial tests for all permutations need to be recorded, an enormous amount of space for data storage is necessary. Even if storage space were not a problem, the discreteness of the p-values for the partial tests is problematic when correcting for multiple testing, because with thousands of tests in an image, ties are likely to occur, further causing ties among the combined statistics. If too many tests across an image share the same most extreme statistic, correction for the MTP-I, while still valid, becomes less powerful (Westfall and Young, 1993; Pantazis et al., 2005). The most obvious workaround — run an ever larger number of permutations to break the ties — may not be possible for small sample sizes, or when possible, requires correspondingly larger data storage.

The solution is loosely based on the direct combination of the test statistics, by converting the test statistics of the partial tests to values that behave as p-values, using the asymptotic distribution of the statistics for the partial tests. We call these as “u-values”, in order to emphasise that they are not meant to be read or interpreted as p-values, but rather as transitional values that allow combinations that otherwise would not be possible.

For spatial statistics, the asymptotic distribution of the combined statistic is used, this time to produce a z-score, which can be subjected to the computation of cluster extent, cluster mass, and/or threshold-free cluster enhancement (TFCE; Smith and Nichols, 2009). A flow chart of the modified algorithm is shown below. Click to see it side-by-side with the original.

## More power, fewer assumptions

One of the most remarkable features of NPC is that the synchronised permutations implicitly account for the dependence structure among the partial tests. This means that even combining methods originally derived under the assumption of independence can be used when such independence is untenable. As the p-values are assessed via permutations, distributional restrictions are likewise not necessary, liberating NPC from most assumptions that thwart parametric methods in general. This renders NPC a good alternative to classical multivariate tests, such as MANOVA, MANCOVA, and Hotelling’s T2 tests: each of the response variables can be seen as an univariate partial test in the context of the combination, but without the assumptions that are embodied in these old multivariate tests.

As if all the above were not already sufficient, NPC is also more powerful than such classical multivariate tests. This refers to its finite sample consistency property, that is, even with fixed sample size, as the number of modalities being combined increases, the power of the test also increases. The power of classical multivariate tests, however, increases up to a certain point, then begins to decrease, eventually reaching zero when the number of combining variables match the sample size.

The figure below summarises the analysis of a subset of the subjects of a published FMRI study (Brooks et al, 2005) in which painful stimulation was applied to the face, hand, and foot of 12 subjects. Using permutation tests separately, no results could be identified for any of the three types of stimulation. A simple multivariate test, the Hotelling’s T2 test, even assessed using permutations, did not reveal any effect of stimulation either. The NPC results, however, suggest involvement of large portions of the anterior insula and secondary somatosensory cortex. The Fisher, Stouffer and Mudholkar–George combining functions were particularly successful in recovering a small area of activity in the midbrain and periaqueductal gray area, which would be expected from previous studies on pain, but that could not be located from the original, non-combined data.

Detailed assessment of power, using variable number of modalities, and of modalities containing signal, is shown in the paper.

## Combinations or conjunctions?

Combination, as done via NPC, is different than conjunctions (Nichols et al., 2005) in the following: in the combination, one seeks for aggregate significance across partial tests, without the need that any individual study is necessarily significant. In the conjunction, it is necessary that all of them, with no exception, is significant. As indicated above, the NPC forms an union-intersection test (UIT; Roy, 1953), whereas the conjunctions form an intersection-union test (IUT; Berger, 1982). The former can be said to be significant if any (or an aggregate) of the partial tests is significant, whereas the latter is significant if all the partial tests are.

## Availability

The NPC, with the modifications for brain imaging, is available in the tool PALM — Permutation Analysis of Linear Models. It runs in either Matlab or Octave, and is free (GPL).

## References

Contributed to this post: Tom Nichols.

# Permutation tests in the Human Connectome Project

Permutation tests are known to be superior to parametric tests: they are based on only few assumptions, essentially that the data are exchangeable, and allow the correction for the multiplicity of tests and the use of various non-standard statistics. The exchangeability assumption allows data to be permuted whenever their joint distribution remains unaltered. Usually this means that each observation needs to be independent from the others.

In many studies, however, there are repeated measurements on the same subjects, which violates exchangeability: clearly, the various measurements obtained from a given subject are not independent from each other. In the Human Connectome Project (HCP) (Van Essen et al, 2012; 2013; see references at the end), subjects are sampled along with their siblings (most of them are twins), such that independence cannot be guaranteed either.

In Winkler et al. (2014), certain structured types of non-independence in brain imaging were addressed through the definition of exchangeability blocks (EBs). Observations within EB can be shuffled freely or, alternatively, the EBs themselves can be shuffled as a whole. This allows various designs that otherwise could not be assessed through permutations.

The same idea can be generalised for blocks that are nested within other blocks, in a multi-level fashion. In the paper Multi-level Block Permutation (Winkler et al., 2015) we presented a method that allows blocks to be shuffled a whole, and inside them, sub-blocks are further allowed to be shuffled, in a recursive process. The method is flexible enough to accommodate permutations, sign-flippings (sometimes also called “wild bootstrap”), and permutations together with sign-flippings.

In particular, this permutation scheme allows the data of the HCP to be analysed via permutations: subjects are allowed to be shuffled with their siblings while keeping the joint distribution intra-sibship maintained. Then each sibship is allowed to be shuffled with others of the same type.

In the paper we show that the error type I is controlled at the nominal level, and the power is just marginally smaller than that would be obtained by permuting freely if free permutation were allowed. The more complex the block structure, the larger the reductions in power, although with large sample sizes, the difference is barely noticeable.

Importantly, simply ignoring family structure in designs as this causes the error rates not to be controlled, with excess false positives, and invalid results. We show in the paper examples of false positives that can arise, even after correction for multiple testing, when testing associations between cortical thickness, cortical area, and measures of body size as height, weight, and body-mass index, all of them highly heritable. Such false positives can be avoided with permutation tests that respect the family structure.

The figure at the top shows how the subjects of the HCP (terminal dots, shown in white colour) can be shuffled or not, while respecting the family structure. Blue dots indicate branches that can be permuted, whereas red dots indicate branches that cannot (see the main paper for details). This diagram includes 232 subjects of an early public release of HCP data. The tree on the left considers dizygotic twins as a category on their own, i.e., that cannot be shuffled with ordinary siblings, whereas the tree on the right considers dizygotic twins as ordinary siblings.

The first applied study using our strategy has just appeared. The method is implemented in the freely available package PALM — Permutation Analysis of Linear Models, and a set of practical steps to use it with actual HCP data is available here.

# Understanding the kinship matrix

Coefficients to assess the genetic resemblance between individuals were presented in the last post. Among these, the coefficient of kinship, $\phi$, is probably the most interesting. It gives a probabilistic estimate that a random gene from a given subject $i$ is identical by descent (ibd) to a gene in the same locus from a subject $j$. For $N$ subjects, these probabilities can be assembled in a $N \times N$ matrix termed kinship matrix, usually represented as $\mathbf{\Phi}$, that has elements $\phi_{ij}$, and that can be used to model the covariance between individuals in quantitative genetics.

Consider the pedigree in the figure below, consisted of 14 subjects:

The corresponding kinship matrix, already multiplied by two to indicate expected covariances between subjects (i.e., $2\cdot\mathbf{\Phi}$), is:

Note that the diagonal elements can have values above unity, given the consanguineous mating in the family (between s09 and s12, indicated by the double line in the pedigree diagram).

In the next post, details on how the kinship matrix can be used investigate heritabilities, genetic correlations, and to perform association studies will be presented.

# All GLM formulas

It’s so often that we find ourselves in the need to quickly compute a statistic for a certain dataset, but finding the formulas is not always direct. Using a statistical software is helpful, but it often also happens that the reported results are not exactly what one may believe it represents. Moreover, even if using these packages, it is always good to have in mind the meaning of statistics and how they are computed. Here the formulas for the most commonly used statistics with the general linear model (glm) are presented, all in matrix form, that can be easily implemented in Octave or Matlab.

## I — Model

We consider two models, one univariate, another multivariate. The univariate is a particular case of the multivariate, but for univariate problems, it is simpler to use the smaller, particular case.

## Univariate model

The univariate glm can be written as:

$\mathbf{y} = \mathbf{M}\boldsymbol{\psi} + \boldsymbol{\epsilon}$

where $\mathbf{y}$ is the $N \times 1$ vector of observations, $\mathbf{M}$ is the $N \times s$ matrix of explanatory variables, $\boldsymbol{\psi}$ is the $s \times 1$ vector of regression coefficients, and $\boldsymbol{\epsilon}$ is the $N \times 1$ vector of residuals.

The null hypothesis can be stated as $\mathcal{H}_0 : \mathbf{C}'\boldsymbol{\psi} = 0$, where $\mathbf{C}$ is a $s \times s'$ matrix that defines a contrast of regression coefficients, satisfying $\mathsf{rank}(\mathbf{C}) = s'$ and $1 \geqslant s' \geqslant s$.

## Multivariate model

The multivariate glm can be written as:

$\mathbf{Y} = \mathbf{M}\boldsymbol{\Psi} + \boldsymbol{\epsilon}$

where $\mathbf{Y}$ is the $N \times q$ vector of observations, $\mathbf{M}$ is the $N \times s$ matrix of explanatory variables, $\boldsymbol{\Psi}$ is the $s \times q$ vector of regression coefficients, and $\boldsymbol{\epsilon}$ is the $N \times q$ vector of residuals.

The null hypothesis can be stated as $\mathcal{H}_0 : \mathbf{C}'\boldsymbol{\Psi}\mathbf{D} = 0$, where $\mathbf{C}$ is defined as above, and $\mathbf{D}$ is a $q \times q'$ matrix that defines a contrast of observed variables, satisfying $\mathsf{rank}(\mathbf{D}) = q'$ and $1 \geqslant q' \geqslant q$.

## II — Estimation of parameters

In the model, the unknowns of interest are the values arranged in $\boldsymbol{\Psi}$. These can be estimated as:

$\boldsymbol{\hat{\Psi}} = (\mathbf{M}'\mathbf{M})^{+}(\mathbf{M}'\mathbf{Y})$

where the $^{+}$ represents a pseudo-inverse. The residuals can be computed as:

$\boldsymbol{\hat{\epsilon}} = \mathbf{Y} - \mathbf{M}\boldsymbol{\hat{\Psi}}$

The above also applies to the univariate case ($\mathbf{y}$ is a particular case of $\mathbf{Y}$, and $\boldsymbol{\psi}$ of $\boldsymbol{\Psi}$).

## Coefficient of determination, R2

The following is the fraction of the variance explained by the part of the model determined by the contrast. It applies to mean-centered data and design, i.e., $\tilde{\mathbf{y}}=\mathbf{y}-\bar{y}$ and $\tilde{\mathbf{M}}=\mathbf{M}-\bar{\mathbf{m}}$.

$R^2 = \dfrac{\boldsymbol{\hat{\psi}}'\mathbf{C} \left(\mathbf{C}'(\tilde{\mathbf{M}}'\tilde{\mathbf{M}})^{-1}\mathbf{C} \right)^{-1} \mathbf{C}'\boldsymbol{\hat{\psi}}}{\tilde{\mathbf{y}}'\tilde{\mathbf{y}}}$

Note that the portion of the variance explained by nuisance variables (if any) remains in the denominator. To have these taken into account, consider the squared partial correlation coefficient, or Pillai’s trace with univariate data, both described further down.

## Pearson’s correlation coefficient

When $\mathsf{rank}\left(\mathbf{C}\right) = 1$, the multiple correlation coefficient can be computed from the $R^2$ statistic as:

$R = \mathsf{sign}\left(\mathbf{C}'\boldsymbol{\hat{\psi}}\right)\sqrt{R^2}$

This value should not be confused, even in the presence of nuisance, with the partial correlation coefficient (see below).

## Student’s t statistic

If $\mathsf{rank}\left(\mathbf{C}\right) = 1$, the Student’s $t$ statistic can be computed as:

$t = \boldsymbol{\hat{\psi}}'\mathbf{C} \left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-\frac{1}{2}} \left/ \sqrt{\dfrac{\boldsymbol{\hat{\epsilon}}'\boldsymbol{\hat{\epsilon}}}{N-\mathsf{rank}\left(\mathbf{M}\right)}} \right.$

## F statistic

The $F$ statistic can be computed as:

$F = \left.\dfrac{\boldsymbol{\hat{\psi}}'\mathbf{C} \left( \mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1} \mathbf{C}'\boldsymbol{\hat{\psi}}}{\mathsf{rank}\left(\mathbf{C} \right)} \right/ \dfrac{\boldsymbol{\hat{\epsilon}}'\boldsymbol{\hat{\epsilon}}}{N-\mathsf{rank}\left(\mathbf{M}\right)}$

## Aspin—Welch v

If homoscedastic variances cannot be assumed, and $\mathsf{rank}\left(\mathbf{C}\right) = 1$, this is equivalent to the Behrens—Fisher problem, and the Aspin—Welch’s $t$ statistic can be computed as:

$v = \boldsymbol{\hat{\psi}}'\mathbf{C} \left(\mathbf{C}'(\mathbf{M}'\mathbf{W}\mathbf{M})^{-1}\mathbf{C} \right)^{-\frac{1}{2}}$

where $\mathbf{W}$ is a diagonal matrix that has elements:

$W_{nn}=\dfrac{\sum_{n' \in g_{n}}R_{n'n'}}{\boldsymbol{\hat{\epsilon}}_{g_{n}}'\boldsymbol{\hat{\epsilon}}_{g_{n}}}$

and where $R_{n'n'}$ are the $n'$ diagonal elements of the residual forming matrix $\mathbf{R} = \mathbf{I}-\mathbf{M}\mathbf{M}^{+}$, and $g_{n}$ is the variance group to which the $n$-th observation belongs.

## Generalised G statistic

If variances cannot be assumed to be the same across all observations, a generalisation of the $F$ statistic can be computed as:

$G = \dfrac{\boldsymbol{\hat{\psi}}'\mathbf{C} \left(\mathbf{C}'(\mathbf{M}'\mathbf{W}\mathbf{M})^{-1}\mathbf{C} \right)^{-1} \mathbf{C}'\boldsymbol{\hat{\psi}}}{\Lambda \cdot \mathsf{rank}\left(\mathbf{C}\right)}$

where $\mathbf{W}$ is defined as above, and the remaining denominator term, $\Lambda$, is given by:

$\Lambda = 1+\frac{2(s-1)}{s(s+2)}\sum_{g} \frac{1}{\sum_{n \in g}R_{nn}} \left(1-\frac{\sum_{n \in g}W_{nn}}{\mathsf{trace}\left(\mathbf{W}\right)}\right)^2$

There is another post on the G-statistic (here).

## Partial correlation coefficient

When $\mathsf{rank}\left(\mathbf{C}\right) = 1$, the partial correlation can be computed from the Student’s $t$ statistic as:

$r = \mathsf{sign}\left(t\right)\sqrt{\dfrac{t^2}{N-\mathsf{rank}\left(\mathbf{M}\right)+t^2}}$

The square of the partial correlation corresponds to Pillai’s trace applied to an univariate model, and it can also be computed from the $F$-statistic as:

$r^2 = \dfrac{F}{\frac{N-\mathsf{rank}\left(\mathbf{M}\right)}{\mathsf{rank}\left(\mathbf{C}\right)}+F}$

Likewise, if $r$ is known, the formula can be solved for $t$:

$t = \dfrac{r\sqrt{N-\mathsf{rank}\left(\mathbf{M}\right)}}{\sqrt{1-r^2}}$

or for $F$:

$F = \dfrac{r^2}{1-r^2}\times\dfrac{N-\mathsf{rank}\left(\mathbf{M}\right)}{\mathsf{rank}\left(\mathbf{C}\right)}$

The partial correlation can also be computed at once for all variables vs. all other variables as follows. Let $\mathbf{A} = \left[\mathbf{y}\; \mathbf{M}\right]$, and $\mathsf{r}\left(\mathbf{A}\right)$ be the inverse of the correlation matrix of the columns of $\mathbf{A}$, and $\mathsf{d}\left(\cdot\right)$ the diagonal operator, that returns a column vector with the diagonal entries of a square matrix. Then the matrix with the partial correlations is:

$\mathbf{r} = -\mathsf{r}\left(\mathbf{A}\right) \odot \left(\mathsf{d}\left(\mathsf{r}\left(\mathbf{A}\right)\right)\mathsf{d}\left(\mathsf{r}\left(\mathbf{A}\right)\right)'\right)^{-\frac{1}{2}}$

where $\odot$ is the Hadamard product, and the power “$-\frac{1}{2}$” is taken elementwise (i.e., not matrix power).

## IV – Multivariate statistics

For the multivariate statistics, define generically $\mathbf{E} = (\boldsymbol{\hat{\epsilon}}\mathbf{D})'(\boldsymbol{\hat{\epsilon}}\mathbf{D})$ as the sums of the products of the residuals, and $\mathbf{H}=(\mathbf{C}'\boldsymbol{\hat{\Psi}}\mathbf{D})' \left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1} (\mathbf{C}'\boldsymbol{\hat{\Psi}}\mathbf{D})$ as the sums of products of the hypothesis. In fact, the original model can be modified as $\tilde{\mathbf{Y}} = \mathbf{M}\tilde{\boldsymbol{\Psi}} + \tilde{\boldsymbol{\epsilon}}$, where $\tilde{\mathbf{Y}}=\mathbf{Y}\mathbf{D}$, $\tilde{\boldsymbol{\Psi}} = \boldsymbol{\Psi}\mathbf{D}$ and $\tilde{\boldsymbol{\epsilon}}=\boldsymbol{\epsilon}\mathbf{D}$.

If $\mathsf{rank}\left(\mathbf{D}\right)=1$, this is an univariate model, otherwise it remains multivariate, although $\mathbf{D}$ can be omitted from the formulas. From now on this simplification is adopted, so that $\mathbf{E} = \boldsymbol{\hat{\epsilon}}'\boldsymbol{\hat{\epsilon}}$ and $\mathbf{H}=\boldsymbol{\hat{\Psi}}'\mathbf{C} \left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1} \mathbf{C}'\boldsymbol{\hat{\Psi}}$.

## Hotelling T2

If $\mathsf{rank}\left(\mathbf{C}\right) = 1$, the Hotelling’s $T^2$ statistic can be computed as:

$T^2 = \mathbf{C}'\boldsymbol{\hat{\Psi}}\boldsymbol{\Sigma}^{-\frac{1}{2}}\left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1}\boldsymbol{\Sigma}^{-\frac{1}{2}} \boldsymbol{\hat{\Psi}}'\mathbf{C}$

where $\boldsymbol{\Sigma} = \mathbf{E}/\left(N-\mathsf{rank}\left(\mathbf{M}\right)\right)$

## Multivariate alternatives to the F statistic

Classical manova/mancova statistics can be based in the ratio between the sums of products of the hypothesis and the sums of products of the residuals, or the ratio between the sums of products of the hypothesis and the total sums of products. In other words, define:

$\begin{array}{ccl} \mathbf{A} &=& \mathbf{H}\mathbf{E}^{-1} = \mathbf{E}^{-\frac{1}{2}} \boldsymbol{\hat{\psi}}'\mathbf{C} \left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1} \mathbf{C}'\boldsymbol{\hat{\psi}}\mathbf{E}^{-\frac{1}{2}}\\ \mathbf{B} &=& \mathbf{H}\left(\mathbf{E}+\mathbf{H}\right)^{-1} \end{array}$

Let $\lambda_i$ be the eigenvalues of $\mathbf{A}$, and $\theta_i$ the eigenvalues of $\mathbf{B}$. Then:

• Wilks’ $\Lambda = \prod_{i} \dfrac{1}{1+\lambda_i} = \dfrac{|\mathbf{E}|}{|\mathbf{E}+\mathbf{H}|}$.
• Lawley–Hotelling’s trace: $\sum_i \lambda_i = \mathsf{trace}\left(\mathbf{A}\right)$.
• Pillai’s trace: $\sum_i \dfrac{\lambda_i}{1+\lambda_i} = \sum_i \theta_i = \mathsf{trace}\left(\mathbf{B}\right)$.
• Roy’s largest root (ii): $\lambda_1 = \mathsf{max}_i\left(\lambda_i\right) = \dfrac{\theta_1}{1-\theta_1}$ (analogous to $F$).
• Roy’s largest root (iii): $\theta_1 = \mathsf{max}_i\left(\theta_i\right) = \dfrac{\lambda_1}{1+\lambda_1}$ (analogous to $R^2$).

When $\mathsf{rank}\left(\mathbf{C}\right) = 1$, or when $\mathbf{Y}$ is univariate, or both, Lawley–Hotelling’s trace is equal to Roy’s (ii) largest root, Pillai’s trace is equal to Roy’s (iii) largest root, and Wilks’ $\Lambda$ added to Pillai’s trace equals to unity. The value $\rho_i=\sqrt{\theta_i}$ is the $i$-th canonical correlation.