# Understanding the kinship matrix

Coefficients to assess the genetic resemblance between individuals were presented in the last post. Among these, the coefficient of kinship, $\phi$, is probably the most interesting. It gives a probabilistic estimate that a random gene from a given subject $i$ is identical by descent (ibd) to a gene in the same locus from a subject $j$. For $N$ subjects, these probabilities can be assembled in a $N \times N$ matrix termed kinship matrix, usually represented as $\mathbf{\Phi}$, that has elements $\phi_{ij}$, and that can be used to model the covariance between individuals in quantitative genetics.

Consider the pedigree in the figure below, consisted of 14 subjects:

The corresponding kinship matrix, already multiplied by two to indicate expected covariances between subjects (i.e., $2\cdot\mathbf{\Phi}$), is:

Note that the diagonal elements can have values above unity, given the consanguineous mating in the family (between s09 and s12, indicated by the double line in the pedigree diagram).

In the next post, details on how the kinship matrix can be used investigate heritabilities, genetic correlations, and to perform association studies will be presented.

# Genetic resemblance between relatives

How similar?

The degree of relationship between two related individuals can be estimated by the probability that a gene in one subject is identical by descent to the corresponding gene (i.e., in the same locus) in the other. Two genes are said to be identical by descent (ibd) if both are copies of the same ancestral gene. Genes that are not ibd may still be identical through separate mutations, and be therefore identical by state (ibs), though these will not be considered in what follows.

The coefficients below were introduced by Jacquard in 1970, in a book originally published in French, and translated to English in 1974. A similar content appeared in an article by the same author in the journal Biometrics in 1972 (see the references at the end).

## Coefficients of identity

Consider a particular autosomal gene $G$. Each individual has two copies, one from paternal, another from maternal origin; these can be indicated as $G_i^P$ and $G_i^M$ for individual $i$. There are 15 exactly distinct ways (states) in which the $G$ can be identical or not identical between two individuals, as shown in the figure below.

To each of these states $S_{1, \ldots , 15}$, a respective probability $\delta_{1, \ldots , 15}$ can be assigned; these are called coefficients of identity by descent. These probabilities can be calculated at every generation following very elementary rules. For most problems, however, the distinction between paternal and maternal origin of a gene is irrelevant, and some of the above states are equivalent to others. If these are condensed, we can retain 9 distinct ways, shown in the figure below:

As before, to each of these states $\Sigma_{1, \ldots , 9}$, a respective probability $\Delta_{1, \ldots , 9}$ can be assigned; these are called condensed coefficients of identity by descent, and relate to the former as:

• $\Delta_1 = \delta_1$
• $\Delta_2 = \delta_6$
• $\Delta_3 = \delta_2 + \delta_3$
• $\Delta_4 = \delta_7$
• $\Delta_5 = \delta_4 + \delta_5$
• $\Delta_6 = \delta_8$
• $\Delta_7 = \delta_9 + \delta_{12}$
• $\Delta_8 = \delta_{10} + \delta_{11} + \delta_{13} + \delta_{14}$
• $\Delta_9 = \delta_{15}$

A similar method was proposed by Cotterman (1940), in his highly influential but only much later published doctoral thesis. The $\Delta_9$, $\Delta_8$ and $\Delta_7$ correspond to his coefficients $k_0$, $k_1$ and $k_2$.

## Coefficient of kinship

The above refer to probabilities of finding particular genes as identical among subjects. However, a different coefficient can be defined for random genes: the probability that a random gene from subject $i$ is identical with a gene at the same locus from subject $j$ is the coefficient of kinship, and can be represented as $\phi_{ij}$:

• $\phi_{ij} = \Delta_1 + \frac{1}{2}(\Delta_3 + \Delta_5 + \Delta_7) + \frac{1}{4}\Delta_8$

If $i$ and $j$ are in fact the same individual, then $\phi_{ii}$ is the kinship of a subject with himself. Two genes taken from the same individual can either be the same gene (probability $\frac{1}{2}$ of being the same) or be the genes inherited from father and mother, in which case the probability is given by the coefficient of kinship between the parents. In other words, $\phi_{ii} = \frac{1}{2} + \frac{1}{2}\phi_{\text{FM}}$. If both parents are unrelated, $\phi_{\text{FM}}=0$, such that the kinship of a subject with himself is $\phi_{ii} = \frac{1}{2}$.

The value of $\phi_{ij}$ can be determined from the number of generations up to a common ancestor $k$. A random gene from individual $i$ can be identical to a random gene from individual $j$ in the same locus if both comes from the common ancestor $k$, an event that can happen if either (1) both are copies of the gene in $k$, or (2) if they are copies of different genes in $k$, but $k$ is inbred; this has probability $\frac{1}{2}f_k$ (see below about the coefficient of inbreeding, $f$). Thus, if there are $m$ generations between $i$ and $k$, and $n$ generations between $j$ and $k$, the coefficient of kinship can be computed as $\phi_{ij} = \left(\frac{1}{2}\right)^{m+n+1}(1+f_k)$. If $i$ and $j$ can have more than one common ancestor, then there are more than one line of descent possible, and the kinship is determined by integrating over all such possible $K$ common ancestors:

• $\phi_{ij} = \sum_{k=1}^K \left(\frac{1}{2}\right)^{m_k+n_k+1}(1+f_k)$

For a set of subjects, the pairwise coefficients of kinship $\phi_{ij}$ can be arranged in a square matrix $\boldsymbol{\Phi}$, and used to model the covariance between subjects as $2\cdot\boldsymbol{\Phi}$ (see here).

## Coefficient of inbreeding

The coefficient of inbreeding $f$ of a given subject $i$ is the coefficient of kinship between their parents. While the above coefficients provide information about pairs of individuals, the coefficient of inbreeding gives information about a particular subject. Yet, $f_i$ can be computed from the coefficients of identity:

• $f_{i} = \Delta_1 + \Delta_2 + \Delta_3 + \Delta_4$
• $f_{j} = \Delta_1 + \Delta_2 + \Delta_5 + \Delta_6$

Note that all these coefficients are based on probabilities, but it is now possible to identify the actual presence of a particular gene using marker data. Also note that while the illustrations above suggest application to livestock, the same applies to studies of human populations.

## Some particular cases

The computation of the above coefficients can be done using algorithms, and are done automatically by software that allow analyses of pedigree data, such as solar. Some common particular cases are shown below:

Relationship $\Delta_1$ $\Delta_2$ $\Delta_3$ $\Delta_4$ $\Delta_5$ $\Delta_6$ $\Delta_7$ $\Delta_8$ $\Delta_9$ $\phi_{ij}$
Self $0$ $0$ $0$ $0$ $0$ $0$ $1$ $0$ $0$ $\frac{1}{2}$
Parent-offspring $0$ $0$ $0$ $0$ $0$ $0$ $0$ $1$ $0$ $\frac{1}{4}$
Half sibs $0$ $0$ $0$ $0$ $0$ $0$ $0$ $\frac{1}{2}$ $\frac{1}{2}$ $\frac{1}{8}$
Full sibs/dizygotic twins $0$ $0$ $0$ $0$ $0$ $0$ $\frac{1}{4}$ $\frac{1}{2}$ $\frac{1}{4}$ $\frac{1}{4}$
Monozygotic twins $0$ $0$ $0$ $0$ $0$ $0$ $1$ $0$ $0$ $\frac{1}{2}$
First cousins $0$ $0$ $0$ $0$ $0$ $0$ $0$ $\frac{1}{4}$ $\frac{3}{4}$ $\frac{1}{16}$
Double first cousins $0$ $0$ $0$ $0$ $0$ $0$ $\frac{1}{16}$ $\frac{6}{16}$ $\frac{9}{16}$ $\frac{1}{8}$
Second cousins $0$ $0$ $0$ $0$ $0$ $0$ $0$ $\frac{1}{16}$ $\frac{15}{16}$ $\frac{1}{64}$
Uncle-nephew $0$ $0$ $0$ $0$ $0$ $0$ $0$ $\frac{1}{2}$ $\frac{1}{2}$ $\frac{1}{8}$
Offspring of sib-matings $\frac{1}{16}$ $\frac{1}{32}$ $\frac{1}{8}$ $\frac{1}{32}$ $\frac{1}{8}$ $\frac{1}{32}$ $\frac{7}{32}$ $\frac{5}{16}$ $\frac{1}{16}$ $\frac{3}{8}$

## References

• Cotterman C. A calculus for statistico-genetics. 1940. PhD Thesis. Ohio State University.
• Jacquard, A. Structures génétiques des populations. Masson, Paris, France, 1970, later translated and republished as Jacquard, A. The genetic structure of populations. Springer, Heidelberg, 1974.
• Jacquard A. Genetic information given by a relative. Biometrics. 1972;28(4):1101-1114.

The photograph at the top (sheep) is in public domain.

# All GLM formulas

It’s so often that we find ourselves in the need to quickly compute a statistic for a certain dataset, but finding the formulas is not always direct. Using a statistical software is helpful, but it often also happens that the reported results are not exactly what one may believe it represents. Moreover, even if using these packages, it is always good to have in mind the meaning of statistics and how they are computed. Here the formulas for the most commonly used statistics with the general linear model (glm) are presented, all in matrix form, that can be easily implemented in Octave or Matlab.

## I — Model

We consider two models, one univariate, another multivariate. The univariate is a particular case of the multivariate, but for univariate problems, it is simpler to use the smaller, particular case.

## Univariate model

The univariate glm can be written as:

$\mathbf{y} = \mathbf{M}\boldsymbol{\psi} + \boldsymbol{\epsilon}$

where $\mathbf{y}$ is the $N \times 1$ vector of observations, $\mathbf{M}$ is the $N \times s$ matrix of explanatory variables, $\boldsymbol{\psi}$ is the $s \times 1$ vector of regression coefficients, and $\boldsymbol{\epsilon}$ is the $N \times 1$ vector of residuals.

The null hypothesis can be stated as $\mathcal{H}_0 : \mathbf{C}'\boldsymbol{\psi} = 0$, where $\mathbf{C}$ is a $s \times s'$ matrix that defines a contrast of regression coefficients, satisfying $\mathsf{rank}(\mathbf{C}) = s'$ and $1 \geqslant s' \geqslant s$.

## Multivariate model

The multivariate glm can be written as:

$\mathbf{Y} = \mathbf{M}\boldsymbol{\Psi} + \boldsymbol{\epsilon}$

where $\mathbf{Y}$ is the $N \times q$ vector of observations, $\mathbf{M}$ is the $N \times s$ matrix of explanatory variables, $\boldsymbol{\Psi}$ is the $s \times q$ vector of regression coefficients, and $\boldsymbol{\epsilon}$ is the $N \times q$ vector of residuals.

The null hypothesis can be stated as $\mathcal{H}_0 : \mathbf{C}'\boldsymbol{\Psi}\mathbf{D} = 0$, where $\mathbf{C}$ is defined as above, and $\mathbf{D}$ is a $q \times q'$ matrix that defines a contrast of observed variables, satisfying $\mathsf{rank}(\mathbf{D}) = q'$ and $1 \geqslant q' \geqslant q$.

## II — Estimation of parameters

In the model, the unknowns of interest are the values arranged in $\boldsymbol{\Psi}$. These can be estimated as:

$\boldsymbol{\hat{\Psi}} = (\mathbf{M}'\mathbf{M})^{+}(\mathbf{M}'\mathbf{Y})$

where the $^{+}$ represents a pseudo-inverse. The residuals can be computed as:

$\boldsymbol{\hat{\epsilon}} = \mathbf{Y} - \mathbf{M}\boldsymbol{\hat{\Psi}}$

The above also applies to the univariate case ($\mathbf{y}$ is a particular case of $\mathbf{Y}$, and $\boldsymbol{\psi}$ of $\boldsymbol{\Psi}$).

## Coefficient of determination, R2

The following is the fraction of the variance explained by the part of the model determined by the contrast. It applies to mean-centered data and design, i.e., $\tilde{\mathbf{y}}=\mathbf{y}-\bar{y}$ and $\tilde{\mathbf{M}}=\mathbf{M}-\bar{\mathbf{m}}$.

$R^2 = \dfrac{\boldsymbol{\hat{\psi}}'\mathbf{C} \left(\mathbf{C}'(\tilde{\mathbf{M}}'\tilde{\mathbf{M}})^{-1}\mathbf{C} \right)^{-1} \mathbf{C}'\boldsymbol{\hat{\psi}}}{\tilde{\mathbf{y}}'\tilde{\mathbf{y}}}$

Note that the portion of the variance explained by nuisance variables (if any) remains in the denominator. To have these taken into account, consider the squared partial correlation coefficient, or Pillai’s trace with univariate data, both described further down.

## Pearson’s correlation coefficient

When $\mathsf{rank}\left(\mathbf{C}\right) = 1$, the multiple correlation coefficient can be computed from the $R^2$ statistic as:

$R = \mathsf{sign}\left(\mathbf{C}'\boldsymbol{\hat{\psi}}\right)\sqrt{R^2}$

This value should not be confused, even in the presence of nuisance, with the partial correlation coefficient (see below).

## Student’s t statistic

If $\mathsf{rank}\left(\mathbf{C}\right) = 1$, the Student’s $t$ statistic can be computed as:

$t = \boldsymbol{\hat{\psi}}'\mathbf{C} \left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-\frac{1}{2}} \left/ \sqrt{\dfrac{\boldsymbol{\hat{\epsilon}}'\boldsymbol{\hat{\epsilon}}}{N-\mathsf{rank}\left(\mathbf{M}\right)}} \right.$

## F statistic

The $F$ statistic can be computed as:

$F = \left.\dfrac{\boldsymbol{\hat{\psi}}'\mathbf{C} \left( \mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1} \mathbf{C}'\boldsymbol{\hat{\psi}}}{\mathsf{rank}\left(\mathbf{C} \right)} \right/ \dfrac{\boldsymbol{\hat{\epsilon}}'\boldsymbol{\hat{\epsilon}}}{N-\mathsf{rank}\left(\mathbf{M}\right)}$

## Aspin—Welch v

If homoscedastic variances cannot be assumed, and $\mathsf{rank}\left(\mathbf{C}\right) = 1$, this is equivalent to the Behrens—Fisher problem, and the Aspin—Welch’s $t$ statistic can be computed as:

$v = \boldsymbol{\hat{\psi}}'\mathbf{C} \left(\mathbf{C}'(\mathbf{M}'\mathbf{W}\mathbf{M})^{-1}\mathbf{C} \right)^{-\frac{1}{2}}$

where $\mathbf{W}$ is a diagonal matrix that has elements:

$W_{nn}=\dfrac{\sum_{n' \in g_{n}}R_{n'n'}}{\boldsymbol{\hat{\epsilon}}_{g_{n}}'\boldsymbol{\hat{\epsilon}}_{g_{n}}}$

and where $R_{n'n'}$ are the $n'$ diagonal elements of the residual forming matrix $\mathbf{R} = \mathbf{I}-\mathbf{M}\mathbf{M}^{+}$, and $g_{n}$ is the variance group to which the $n$-th observation belongs.

## Generalised G statistic

If variances cannot be assumed to be the same across all observations, a generalisation of the $F$ statistic can be computed as:

$G = \dfrac{\boldsymbol{\hat{\psi}}'\mathbf{C} \left(\mathbf{C}'(\mathbf{M}'\mathbf{W}\mathbf{M})^{-1}\mathbf{C} \right)^{-1} \mathbf{C}'\boldsymbol{\hat{\psi}}}{\Lambda \cdot \mathsf{rank}\left(\mathbf{C}\right)}$

where $\mathbf{W}$ is defined as above, and the remaining denominator term, $\Lambda$, is given by:

$\Lambda = 1+\frac{2(s-1)}{s(s+2)}\sum_{g} \frac{1}{\sum_{n \in g}R_{nn}} \left(1-\frac{\sum_{n \in g}W_{nn}}{\mathsf{trace}\left(\mathbf{W}\right)}\right)^2$

There is another post on the G-statistic (here).

## Partial correlation coefficient

When $\mathsf{rank}\left(\mathbf{C}\right) = 1$, the partial correlation can be computed from the Student’s $t$ statistic as:

$r = \mathsf{sign}\left(t\right)\sqrt{\dfrac{t^2}{N-\mathsf{rank}\left(\mathbf{M}\right)+t^2}}$

The square of the partial correlation corresponds to Pillai’s trace applied to an univariate model, and it can also be computed from the $F$-statistic as:

$r^2 = \dfrac{F}{\frac{N-\mathsf{rank}\left(\mathbf{M}\right)}{\mathsf{rank}\left(\mathbf{C}\right)}+F}$.

Likewise, if $r$ is known, the formula can be solved for $t$:

$t = \dfrac{r\sqrt{N-\mathsf{rank}\left(\mathbf{M}\right)}}{\sqrt{1-r^2}}$

or for $F$:

$F = \dfrac{r^2}{1-r^2}\times\dfrac{N-\mathsf{rank}\left(\mathbf{M}\right)}{\mathsf{rank}\left(\mathbf{C}\right)}$

The partial correlation can also be computed at once for all variables vs. all other variables as follows. Let $\mathbf{A} = \left[\mathbf{y}\; \mathbf{M}\right]$, and $\mathsf{r}\left(\mathbf{A}\right)$ be the inverse of the correlation matrix of the columns of $\mathbf{A}$, and $\mathsf{d}\left(\cdot\right)$ the diagonal operator, that returns a column vector with the diagonal entries of a square matrix. Then the matrix with the partial correlations is:

$\mathbf{r} = -\mathsf{r}\left(\mathbf{A}\right) \odot \left(\mathsf{d}\left(\mathsf{r}\left(\mathbf{A}\right)\right)\mathsf{d}\left(\mathsf{r}\left(\mathbf{A}\right)\right)'\right)^{-\frac{1}{2}}$

where $\odot$ is the Hadamard product, and the power “$-\frac{1}{2}$” is taken elementwise (i.e., not matrix power).

## IV – Multivariate statistics

For the multivariate statistics, define generically $\mathbf{E} = (\boldsymbol{\hat{\epsilon}}\mathbf{D})'(\boldsymbol{\hat{\epsilon}}\mathbf{D})$ as the sums of the products of the residuals, and $\mathbf{H}=(\mathbf{C}'\boldsymbol{\hat{\Psi}}\mathbf{D})' \left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1} (\mathbf{C}'\boldsymbol{\hat{\Psi}}\mathbf{D})$ as the sums of products of the hypothesis. In fact, the original model can be modified as $\tilde{\mathbf{Y}} = \mathbf{M}\tilde{\boldsymbol{\Psi}} + \tilde{\boldsymbol{\epsilon}}$, where $\tilde{\mathbf{Y}}=\mathbf{Y}\mathbf{D}$, $\tilde{\boldsymbol{\Psi}} = \boldsymbol{\Psi}\mathbf{D}$ and $\tilde{\boldsymbol{\epsilon}}=\boldsymbol{\epsilon}\mathbf{D}$.

If $\mathsf{rank}\left(\mathbf{D}\right)=1$, this is an univariate model, otherwise it remains multivariate, although $\mathbf{D}$ can be omitted from the formulas. From now on this simplification is adopted, so that $\mathbf{E} = \boldsymbol{\hat{\epsilon}}'\boldsymbol{\hat{\epsilon}}$ and $\mathbf{H}=\boldsymbol{\hat{\Psi}}'\mathbf{C} \left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1} \mathbf{C}'\boldsymbol{\hat{\Psi}}$.

## Hotelling T2

If $\mathsf{rank}\left(\mathbf{C}\right) = 1$, the Hotelling’s $T^2$ statistic can be computed as:

$T^2 = \mathbf{C}'\boldsymbol{\hat{\Psi}}\boldsymbol{\Sigma}^{-\frac{1}{2}}\left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1}\boldsymbol{\Sigma}^{-\frac{1}{2}} \boldsymbol{\hat{\Psi}}'\mathbf{C}$

where $\boldsymbol{\Sigma} = \mathbf{E}/\left(N-\mathsf{rank}\left(\mathbf{M}\right)\right)$

## Multivariate alternatives to the F statistic

Classical manova/mancova statistics can be based in the ratio between the sums of products of the hypothesis and the sums of products of the residuals, or the ratio between the sums of products of the hypothesis and the total sums of products. In other words, define:

$\begin{array}{ccl} \mathbf{A} &=& \mathbf{H}\mathbf{E}^{-1} = \mathbf{E}^{-\frac{1}{2}} \boldsymbol{\hat{\psi}}'\mathbf{C} \left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1} \mathbf{C}'\boldsymbol{\hat{\psi}}\mathbf{E}^{-\frac{1}{2}}\\ \mathbf{B} &=& \mathbf{H}\left(\mathbf{E}+\mathbf{H}\right)^{-1} \end{array}$

Let $\lambda_i$ be the eigenvalues of $\mathbf{A}$, and $\theta_i$ the eigenvalues of $\mathbf{B}$. Then:

• Wilks’ $\Lambda = \prod_{i} \dfrac{1}{1+\lambda_i} = \dfrac{|\mathbf{E}|}{|\mathbf{E}+\mathbf{H}|}$.
• Lawley–Hotelling’s trace: $\sum_i \lambda_i = \mathsf{trace}\left(\mathbf{A}\right)$.
• Pillai’s trace: $\sum_i \dfrac{\lambda_i}{1+\lambda_i} = \sum_i \theta_i = \mathsf{trace}\left(\mathbf{B}\right)$.
• Roy’s largest root (ii): $\lambda_1 = \mathsf{max}_i\left(\lambda_i\right) = \dfrac{\theta_1}{1-\theta_1}$ (analogous to $F$).
• Roy’s largest root (iii): $\theta_1 = \mathsf{max}_i\left(\theta_i\right) = \dfrac{\lambda_1}{1+\lambda_1}$ (analogous to $R^2$).

When $\mathsf{rank}\left(\mathbf{C}\right) = 1$, or when $\mathbf{Y}$ is univariate, or both, Lawley–Hotelling’s trace is equal to Roy’s (ii) largest root, Pillai’s trace is equal to Roy’s (iii) largest root, and Wilks’ $\Lambda$ added to Pillai’s trace equals to unity. The value $\rho_i=\sqrt{\theta_i}$ is the $i$-th canonical correlation.

# The G-statistic

## Preliminaries

Consider the common analysis of a neuroimaging experiment. At each voxel, vertex, face or edge (or any other imaging unit), we have a linear model expressed as:

$\mathbf{Y} = \mathbf{M} \boldsymbol{\psi} + \boldsymbol{\epsilon}$

where $\mathbf{Y}$ contains the experimental data, $\mathbf{M}$ contains the regressors, $\boldsymbol{\psi}$ the regression coefficients, which are to be estimated, and $\boldsymbol{\epsilon}$ the residuals. For a linear null hypothesis $\mathcal{H}_0 : \mathbf{C}'\boldsymbol{\psi}=\mathbf{0}$, where $\mathbf{C}$ is a contrast. If $\mathsf{rank}\left(\mathbf{C}\right) = 1$, the Student’s t statistic can be calculated as:

$t = \boldsymbol{\hat{\psi}}'\mathbf{C} \left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-\frac{1}{2}} \left/ \sqrt{\dfrac{\boldsymbol{\hat{\epsilon}}'\boldsymbol{\hat{\epsilon}}}{N-\mathsf{rank}\left(\mathbf{M}\right)}} \right.$

where the hat on $\boldsymbol{\hat{\psi}}$ and $\boldsymbol{\hat{\epsilon}}$ indicate that these are quantities estimated from the sample. If $\mathsf{rank}\left(\mathbf{C}\right) \geqslant 1$, the F statistic can be obtained as:

$F = \left.\dfrac{\boldsymbol{\hat{\psi}}'\mathbf{C} \left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1} \mathbf{C}'\boldsymbol{\hat{\psi}}}{\mathsf{rank}\left(\mathbf{C}\right)} \right/ \dfrac{\boldsymbol{\hat{\epsilon}}'\boldsymbol{\hat{\epsilon}}}{N-\mathsf{rank}\left(\mathbf{M}\right)}$

When $\mathsf{rank}\left(\mathbf{C}\right) = 1$, $F = t^2$. For either of these statistics, we can assess their significance by repeating the same fit after permuting $\mathbf{Y}$ or $\mathbf{M}$ (i.e., a permutation test), or by referring to the formula for the distribution of the corresponding statistic, which is available in most statistical software packages (i.e., a parametric test).

Permutation tests don’t depend on the same assumptions on which parametric tests are based. As some of these assumptions can be quite stringent in practice, permutation methods arguably should be preferred as a general framework for the statistical analysis of imaging data. At each permutation, a new statistic is computed and used to build its empirical distribution from which the p-values are obtained. In practice it’s not necessary to build the full distribution, and it suffices to increment a counter at each permutation. At the end, the counter is divided by the number of permutations to produce a p-value.

An example of a permutation distribution

Using permutation tests, correction for multiple testing using the family-wise error rate (fwer) is trivial: rather than build the permutation distribution at each voxel, a single distribution of the global maximum of the statistic across the image is constructed. Each permutation yields one maximum, that is used to build the distribution. Any dependence between the tests is implicitly captured, with not need to model it explicitly, nor to introduce even more assumptions, a problem that hinders methods such as the random field theory.

## Exchangeability blocks

Permutation is allowed if it doesn’t affect the joint distribution of the error terms, i.e., if the errors are exchangeable. Some types of experiments may involve repeated measurements or other kinds of dependency, such that exchangeability cannot be guaranteed between all observations. However, various cases of structured dependency can still be accommodated if sets (blocks) of observations are shuffled as a whole, or if shuffling happens only within set (block). It is not necessary to know or to model the exact dependence structure between observations, which is captured implicitly as long as the blocks are defined correctly.

Permutation within block.

Permutation of blocks as a whole.

The two figures above are of designs constructed using the fsl software package. In fsl, within-block permutation is available in randomise with the option -e, used to supply a file with the definition of blocks. For whole-block permutation, in addition to the option -e, the option --permuteBlocks needs to be supplied.

## The G-statistic

The presence of exchangeability blocks solves a problem, but creates another. Having blocks implies that observations may not be pooled together to produce a non-linear parameter estimate such as the variance. In other words: the mere presence of exchangeability blocks, either for shuffling within or as a whole, implies that the variances may not be the same across all observations, and a single estimate of this variance is likely to be inaccurate whenever the variances truly differ, or if the groups don’t have the same size. This also means that the F or t statistics may not behave as expected.

The solution is to use the block definitions and the permutation strategy is to define groups of observations that are known or assumed to have identical variances, and pool only the observations within group for variance estimation, i.e., to define variance groups (vgs).

The F-statistic, however, doesn’t allow such multiple groups of variances, and we need to resort to another statistic. In Winkler et al. (2014) we propose:

$G = \dfrac{\boldsymbol{\hat{\psi}}'\mathbf{C} \left(\mathbf{C}'(\mathbf{M}'\mathbf{W}\mathbf{M})^{-1}\mathbf{C} \right)^{-1} \mathbf{C}'\boldsymbol{\hat{\psi}}}{\Lambda \cdot s}$

where $\mathbf{W}$ is a diagonal matrix that has elements:

$W_{nn}=\dfrac{\sum_{n' \in g_{n}}R_{n'n'}}{\boldsymbol{\hat{\epsilon}}_{g_{n}}'\boldsymbol{\hat{\epsilon}}_{g_{n}}}$

and where $R_{n'n'}$ are the $n'$ diagonal elements of the residual forming matrix, and $g_{n}$ is the variance group to which the $n$-th observation belongs. The remaining denominator term, $\Lambda$, is given by (Welch, 1951):

$\Lambda = 1+\frac{2(s-1)}{s(s+2)}\sum_{g} \frac{1}{\sum_{n \in g}R_{nn}} \left(1-\frac{\sum_{n \in g}W_{nn}}{\mathsf{trace}\left(\mathbf{W}\right)}\right)^2$

where $s=\mathsf{rank}\left(\mathbf{C}\right)$. The matrix $\mathbf{W}$ can be seen as a weighting matrix, the square root of which normalises the model such that the errors have then unit variance and can be ignored. It can also be seen as being itself a variance estimator. In fact, it is the very same variance estimator proposed by Horn et al (1975).

The W matrix used with the G statistic. It is constructed from the estimated variances of the error terms.

The matrix $\mathbf{W}$ has a crucial role in making the statistic pivotal in the presence of heteroscedasticity. Pivotality means that the statistic has a sampling distribution that is not dependent on any unknown parameter. For imaging experiments, it’s important that the statistic has this property, otherwise correction for multiple testing that controls fwer will be inaccurate, or possibly invalid altogether.

When $\mathsf{rank}\left(\mathbf{C}\right)=1$, the t-equivalent to the G-statistic is $v = \boldsymbol{\hat{\psi}}'\mathbf{C} \left(\mathbf{C}'(\mathbf{M}'\mathbf{W}\mathbf{M})^{-1}\mathbf{C} \right)^{-\frac{1}{2}}$, which is the well known Aspin-Welch $v$-statistic for the Behrens-Fisher problem. The relationship between $v$ and G is the same as between t and F, i.e., when the rank of the contrast equals to one, the latter is simply the square of the former. The G statistic is a generalization of all these, and more, as we show in the paper, and summarise in the table below:

$\mathsf{rank}\left(\mathbf{C}\right) = 1$ $\mathsf{rank}\left(\mathbf{C}\right) > 1$
Homoscedastic errors, unrestricted exchangeability Square of Student’s t F-ratio
Homoscedastic within vg, restricted exchangeability Square of Aspin-Welch $v$ Welch’s $v^2$

In the absence of variance groups (i.e., all observations belong to the same vg), G and $v$ are equivalent to F and t respectively.

Although not typically necessary if permutation methods are to be preferred, approximate parametric p-values for the G-statistic can be computed from an F-distribution with $\nu_1=s$ and $\nu_2=2(s-1)/3/(\Lambda-1)$.

While the error rates are controlled adequately (a feature of permutation tests in general), the G-statistic offers excellent power when compared to the F-statistic, even when the assumptions of the latter are perfectly met. Moreover, by preserving pivotality, it is an adequate statistic to control of the error rate in the in the presence of multiple tests.

In this post, the focus is in using G for imaging data, but of course, it can be used for any dataset in which a linear model where variances cannot be assumed to be the same is used, i.e., when heteroscedasticity is or could be present.

Note that the G-statistic has nothing to do with the G-test. It is named as this for being a generalisation over various tests, including the commonly used t and F tests, as shown above.

## Main reference

The core reference and results for the G-statistic have just been published in Neuroimage:

## Other references

The two other references cited, which are useful to understand the variance estimator and the parametric approximation are:

# Biased, as it should be

When we run a permutation test, what we do is to first compute the statistic for the unpermuted model, then shuffle the data randomly, compute a new statistic and, if it is larger than the statistic for the unpermuted model, increment a counter. We then repeat the random shuffling, and keep incrementing the counter until we think we are done with them. We then divide the counter by the number of permutations and the result is the p-value what we wanted to find.

This procedure can be stated as:

$\hat{p} = \frac{1}{N}\sum^{N}_{n=1}\mathcal{I}(T^{*}_{n} \geqslant T_0)$

where $\hat{p}$ is the estimated p-value, $N$ is the number of permutations, $T_0$ is the statistic for the unpermuted model, $T^{*}_n$ is the statistic for the $n$-th random shuffling, and $\mathcal{I}$ is an indicator function that evaluates as 1 if the condition between parenthesis is true, or 0 otherwise. This formulation produces unbiased results: since $E(\mathcal{I}(T^{*}_{n} \geqslant T_0)) = p$, the true p-value, then $E(\frac{1}{N}\sum^{N}_{n=1}\mathcal{I}(T^{*}_{n} \geqslant T_0)) = p$.

## The problem

This strategy, however, has a problem. If the true p-value, $p$, is small, or if the number of permutations is not sufficiently large, even after all the $N$ permutations are done, no $T^{*}_{n}$ may reach or surpass the value of $T_0$, resulting in a p-value equal to 0. While this is still a valid, unbiased result, p-values equal to zero are problematic, as they can be interpreted as indicating the rejection of the null-hypothesis with absolute confidence. And this even after very few permutations (perhaps just 1!) have been performed, which is indeed rather counter-intuitive. We cannot be completely sure of the rejection of the null simply for having done just a few permutations.

## Biasing to solve the problem

The solution is to make two modifications to the formulation above: start counting the $n$-th permutation at 0, for the unpermuted model, and divide the counter at the end not by $N$, but by $N+1$:

$\hat{p} = \frac{1}{N+1}\sum^{N}_{n=0}\mathcal{I}(T^{*}_{n} \geqslant T_0)$

This formulation, which pools the unpermuted model together with the permuted realisations, is biased: for $n=0$, $\mathcal{I}(T^{*}_{n} \geqslant T_0) = 1$,  and so, $E(\frac{1}{N+1}\sum^{N}_{n=0}\mathcal{I}(T^{*}_{n} \geqslant T_0)) = \frac{1+Np}{N+1} \neq p$.

## Quantifying the bias

The bias is smaller for a large number of permutations, and converges to zero as $N$ increases towards infinity. What is important, however, is that the bias punishes the results towards conservativeness whenever the true and unknown $p$ is smaller than $\frac{1}{N+1}$, in a way that no p-value smaller than $\frac{1}{N+1}$ can ever be attained. It forces the researcher to run more permutations to find such small p-values to reject the null hypothesis, which otherwise could be rejected with a p-value equal to zero that would appear easily even after just a handful of permutations.

The bias is larger for smaller number of permutations and for smaller true p-values.

## Controversy

The solution has been proposed by some authors, such as Edgington (1995), and strongly advocated by, e.g., Phipson and Smyth (2010). While the biased solution is definitely the most adequate for practical scientific problems, it can be considered as of little actual statistical meaning. Pesarin and Salmaso (2010) explain that the distribution of $T_0$ is not the same as the distribution of $T^{*}_{n}$, unless the null is true. Under the alternative hypothesis, therefore, $T_0$ should not be pooled into the empirical distribution from which the p-value is obtained. Moreover, under the null, all possible permutations have the same probability, with no intrinsic reason to treat some (as the unpermuted) differently than others. If one wants unbiased results, the alternative formulation should obviously not be used.

# Competition ranking and empirical distributions

Estimating the empirical cumulative distribution function (empirical cdf) from a set of observed values requires computing, for each value, how many of the other values are smaller or equal to it. In other words, let $X = \{x_1, x_2, \ldots , x_N\}$ be the observed values, then:

$F(x) = P(X \leqslant x) = \frac{1}{N} \sum_{n=1}^{N}I(x_n \leqslant x)$

where $F(x)$ is the empirical cdf, $P(X \leqslant x)$ is the probability of observing a value in $X$ that is smaller than or equal to $x$, $I(\cdot)$ is a function that evaluates as 1 if the condition in the parenthesis is satisfied or 0 otherwise, and $N$ is the number of observations. This definition is straightforward and can be found in many statistical textbooks (see, e.g., Papoulis, 1991).

For continuous distributions, a p-value for a given $x \in X$ can be computed as $1-F(x)$. For a discrete cdf, however, this trivial relationship does not hold. The reason is that, while for continuous distributions the probability $P(X = x) = 0$, for discrete distributions (such as empirical distributions), $P(X = x) > 0$ whenever $x \in X$. As we often want to assign a probability to one of the values that have been observed, the condition $x \in X$ always holds, and the simple relationship between the cdf and the p-value is lost. The solution, however, is still simple: a p-value can be obtained from the cdf as $1-F(x) + 1/N$.

In the absence of repeated values (ties), the cdf can be obtained computationally by sorting the observed data in ascending order, i.e., $X_{s} = \{x_{(1)}, x_{(2)}, \ldots , x_{(N)}\}$. Then $F(x)=(n_x)/N$, where $(n_x)$ represents the ascending rank of $x$. Likewise, the p-value can be obtaining by sorting the data in descending order, and using a similar formula, $P(X \geqslant x) = (\tilde{n}_x)/N$, where $(\tilde{n}_x)$ represents the descending rank of $x$.

Example 1: Consider the following sequence of 10 observed values, already sorted in ascending order:

$X=\{75, 76, 79, 80, 84, 85, 86, 88, 90, 94\}$

The corresponding ranks are simply 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Consequently, the cdf is:

$F(x|x\in X) = \{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1\}$

The p-values are computed by using the ranks in descending order, i.e., 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, which yields:

$P(X \geqslant x | x \in X) = \{1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1\}$

## The problem with ties

Simple sorting, as above, is effective when there are no repeated values in the data. However, in the presence of ties, simple sorting that produces ordinal rankings yield incorrect results.

Example 2: Consider the following sequence of 10 observed values, in which ties are present:

$X = \{81, 81, 82, 83, 83, 83, 84, 85, 85, 85\}$

If the same computational strategy discussed above were used, the we would conclude that the p-value for $x=85$ would be either 0.1, 0.2 or 0.3, depending on which instance of the “85” we were to choose. The correct value, however, is 0.3 (only), since 3 out of 10 variables are equal or above 85. The solution to this problem is to replace the simple, ordinal ranking, for a modified version of the so called competition ranking.

## The competition ranking

The competition ranking assigns the same rank to values that are identical. The standard competition ranking for the previous sequence is, in ascending order, 1, 1, 3, 4, 4, 4, 7, 8, 8, 8. In descending order, the ranks are 9, 9, 8, 5, 5, 5, 4, 1, 1, 1. Note that in this ranking, the ties receive the “best” possible rank. A slightly different approach consists in assigning the “worst” possible rank to the ties. This modified competition ranking applied the previous sequence gives, in ascending order, the ranks 2, 2, 3, 5, 5, 5, 7, 10, 10, 10, whereas in descending order, the ranks are 10, 10, 8, 7, 7, 7, 4, 3, 3, 3.

Competition ranking solves the problem with ties for both the empirical cdf and for the calculation of the p-values. In both cases, it is the modified version of the ranking that needs to be used, i.e., the ranking that assigns the worst rank to identical values. Still using the same sequence as example, the cdf is given by:

$F(x|x\in X) = \{0.2, 0.2, 0.3, 0.6, 0.6, 0.6, 0.7, 1, 1, 1\}$

and the p-values are given by:

$P(X \geqslant x | x \in X) = \{1, 1, 0.8, 0.7, 0.7, 0.7, 0.4, 0.3, 0.3, 0.3\}$

Note that, as in the case with no ties, the p-values cannot be computed simply as $1-F(x)$ from the cdf as it would for continuous distributions. The correct formula is $1-F(x)+(n_x)/N$, with $(n_x)$ here representing the frequency of $x$.

## Uses

The cdf of a set of observed data is useful in many types of non-parametric inference, particularly those that use bootstrap, jack-knife, Monte Carlo, or permutation tests. Ties can be quite common when working with discrete data and discrete explanatory variables using any of these methods.

## Competition ranking in MATLAB and Octave

A function that computes the competition ranks for Octave and/or MATLAB is available to download here: csort.m.

Once installed, type help csort to obtain usage information.

## References

Papoulis A. Probability, Random Variables and Stochastic Processes. McGraw-Hill, New York, 3rd ed, 1991.

# The logic of the Fisher method to combine P-values

Consider a set of $k$ independent tests, each of these to test a certain null hypothesis $\mathcal{H}_{0|i}$, $i=\{1, 2, \ldots, k\}$. For each test, a significance level $P_{i}$, i.e., a p-value, is obtained. All these p-values can be combined into a joint test whether there is a global effect, i.e., if a global null hypothesis $\mathcal{H}_0$ can be rejected.

There are a number of ways to combine these independent, partial tests. The Fisher method is one of these, and is perhaps the most famous and most widely used. The test was presented in Fisher’s now classical book, Statistical Methods for Research Workers, and was described rather succinctly:

When a number of quite independent tests of significance have been made, it sometimes happens that although few or none can be claimed individually as significant, yet the aggregate gives an impression that the probabilities are on the whole lower than would often have been obtained by chance. It is sometimes desired, taking account only of these probabilities, and not of the detailed composition of the data from which they are derived, which may be of very different kinds, to obtain a single test of the significance of the aggregate, based on the product of the probabilities individually observed.
The circumstance that the sum of a number of values of $\chi^{2}$ is itself distributed in the $\chi^{2}$ distribution with the appropriate number of degrees of freedom, may be made the basis of such a test. For in the particular case when $n=2$, the natural logarithm of the probability is equal to $\frac{1}{2}\chi^{2}$. If therefore we take the natural logarithm of a probability, change its sign and double it, we have the equivalent value of $\chi^{2}$ for 2 degrees of freedom. Any number of such values may be added together, to give a composite test, using the Table of $\chi^{2}$ to examine the significance of the result. — Fisher, 1932.

The test is based on the fact that the probability of rejecting the global null hypothesis is related to intersection of the probabilities of each individual test, $\prod_{i} P_{i}$. However, $\prod_{i} P_{i}$ is not uniformly distributed, even if the null is true for all partial tests, and cannot be used itself as the joint significance level for the global test. To remediate this fact, some interesting properties and relationships among distributions of random variables were exploited by Fisher and embodied in the succinct excerpt above. These properties are discussed below.

## The logarithm of uniform is exponential

The cumulative distribution function (cdf) of an exponential distribution is:

$F(x)=1- e^{-\lambda x}$

where $\lambda$ is the rate parameter, the only parameter of this distribution. The inverse cdf is, therefore, given by:

$x = -\dfrac{1}{\lambda}\ln(1-F(x))$

If $P$ is a random variable uniformly distributed in the interval $[0, 1]$, so is $1-P$, and it is immaterial to differ between them. As a consequence, the previous equation can be equivalently written as:

$x = -\dfrac{1}{\lambda}\ln(P)$

where $P \sim \mathcal{U}(0,1)$, which highlights the fact that the negative of the natural logarithm of a random variable distributed uniformly between 0 and 1 follows an exponential distribution with rate parameter $\lambda=1$.

## An exponential with rate 1/2 is chi-squared

The cdf of a chi-squared distribution with $\nu$ degrees of freedom, i.e. $\chi^{2}_{\nu}$, is given by:

$F(x; \nu) = \dfrac{\int_{0}^{x/2} t^{\frac{\nu}{2}-1}e^{-t}{\rm d}t}{\left(\frac{\nu}{2}-1\right)!}$

If $\nu=2$, and solving the integral we have:

$F(x; \nu=2) = \dfrac{\int_{0}^{x/2} t^{\frac{2}{2}-1}e^{-t}{\rm d}t}{\left(\frac{2}{2}-1\right)!} = \int_{0}^{x/2} e^{-t}{\rm d}t = 1-e^{-x/2}$

In other words, a $\chi^{2}$ distribution with $\nu=2$ is equivalent to an exponential distribution with rate parameter $\lambda=1/2$.

## The sum of chi-squared is also chi-squared

The moment-generating function (mgf) of a sum of independent variables is the product of the mgfs of the respective variables. The mgf of a $\chi^{2}_{\nu}$ is:

$M(t) = (1-2t)^{-\nu/2}$

The mgf of the sum of $k$ independent variables that follow each a $\chi^{2}_{2}$ distribution is then given by:

$M_{\text{sum}}(t) = \prod_{i=1}^{k} (1-2t)^{-2/2} = (1-2t)^{-k}$

which also defines a $\chi^{2}$ distribution, however with degrees of freedom $\nu=2k$.

## Assembling the pieces

With these facts in mind, how to transform the product $\prod_{i} P_{i}$ into a p-value that is uniformly distributed when the global null is true? The product can be converted into a sum by taking the logarithm. And as shown above, the logarithm of uniformly distributed variables follows an exponential distribution with rate parameter $\lambda=1$. Multiplication of each $\ln(P_{i})$ by 2 changes the rate parameter to $\lambda=1/2$ and makes this distribution equivalent to a $\chi^{2}$ distribution with degrees of freedom $\nu=2$. The sum of $k$ of these logarithms also follow a $\chi^2$ distribution, now with $\nu=2k$ degrees of freedom, i.e., $\chi^{2}_{2k}$.

The statistic for the Fisher method is, therefore, computed as:

$X = -2 \sum_{i=1}^{k} \ln(P_{i})$

with $X$ following a $\chi^{2}_{2k}$ distribution, from which a p-value for the global hypothesis can be easily obtained.

## Reference

The details above are not in the book, presumably omitted by Fisher as the knowledge of these derivation details would be of little practical use. Nonetheless, the reference for the book is:

The Fisher’s method to combine p-values is one of the most powerful combining functions that can be used for Non-Parametric Combination.

# Confidence intervals for Bernoulli trials

A Bernoulli trial is an experiment in which the outcome can be one of two mutually exclusive results, e.g. true/false, success/failure, heads/tails and so on. A number of methods are available to compute confidence intervals after many such trials have been performed. The most common have been discussed and reviewed by Brown et al. (2001), and are presented below. Consider $n$ trials, with $X$ successes and a significance level of $\alpha=0.05$ to obtain a 95% confidence interval. For each of the methods, the interval is shown graphically for $1 \leqslant n \leqslant 100$ and $1 \leqslant X \leqslant n$.

## Wald

This is the most common method, discussed in many textbooks, and probably the most problematic for small samples. It is based on a normal approximation to the binomial distribution, and it is often called “Wald interval” for it’s relationship with the Wald test. The interval is calculated as:

• Lower bound: $L=p-k\sqrt{pq/n}$
• Upper bound: $U=p+k\sqrt{pq/n}$

where $k = \Phi^{-1}\{1-\alpha/2\}$, $\Phi^{-1}$ is the probit function, $p=X/n$ and $q=1-p$.

Wald confidence interval.

## Wilson

This interval appeared in Wilson (1927) and is defined as:

• Lower bound: $L = \tilde{p} - (k/\tilde{n})\sqrt{npq+k^2/4}$
• Upper bound: $U = \tilde{p} + (k/\tilde{n})\sqrt{npq+k^2/4}$

where $\tilde{p} = \tilde{X}/\tilde{n}$, $\tilde{n}=n+k^2$, $\tilde{X} = X+ k^2/2$, and the remaining are as above. This is probably the most appropriate for the majority of situations.

Wilson confidence interval.

## Agresti-Coull

This interval appeared in Agresti and Coull (1998) and shares many features with the Wilson interval. It is defined as:

• Lower bound: $L = \tilde{p} - k\sqrt{\tilde{p}\tilde{q}/\tilde{n}}$
• Upper bound: $U = \tilde{p} + k\sqrt{\tilde{p}\tilde{q}/\tilde{n}}$

where $\tilde{q}=1-\tilde{p}$, and the remaining are as above.

Agresti-Coull confidence interval.

## Jeffreys

This interval has a Bayesian motivation and uses the Jeffreys prior (Jeffreys, 1946). It seems to have been introduced by Brown et al. (2001). It is defined as:

• Lower bound: $L = \mathcal{B}^{-1}\{\alpha/2,X+1/2,n-X+1/2\}$
• Upper bound: $U = \mathcal{B}^{-1}\{1-\alpha/2,X+1/2,n-X+1/2\}$

where $\mathcal{B}^{-1}\{x,s_1,s_2\}$ is the inverse cdf of the beta distribution (not to be confused with the beta function), at the quantile $x$, and with shape parameters $s_1$ and $s_2$.

Jeffreys confidence interval.

## Clopper-Pearson

This interval was proposed by Clopper and Pearson (1934) and is based on a binomial test, rather than on approximations, hence sometimes being called “exact”, although it is not “exact” in the common sense. It is considered overly conservative.

• Lower bound: $L = \mathcal{B}^{-1}\{\alpha/2,X,n-X+1\}$
• Upper bound: $U = \mathcal{B}^{-1}\{1-\alpha/2,X+1,n-X\}$

where $\mathcal{B}^{-1}\{x,s_1,s_2\}$ is the inverse cdf of the beta distribution as above.

Clopper-Pearson confidence interval.

## Arc-Sine

This interval is based on the arc-sine variance-stabilising transformation. The interval is defined as:

• Lower bound: $L = \sin\{\arcsin\{\sqrt{a}\} - k/(2\sqrt{n})\}^2$
• Upper bound: $U = \sin\{\arcsin\{\sqrt{a}\} + k/(2\sqrt{n})\}^2$

where $a=\frac{X+3/8}{n+3/4}$ replaces what otherwise would be $p$ (Anscombe, 1948).

Arc-sine confidence interval.

## Logit

This interval is based on the Wald interval for $\lambda = \ln\{\frac{X}{n-X}\}$. It is defined as:

• Lower bound: $L = e^{\lambda_L}/(1+e^{\lambda_L})$
• Upper bound: $U = e^{\lambda_U}/(1+e^{\lambda_U})$

where $\lambda_L = \lambda - k\sqrt{\hat{V}}$, $\lambda_U = \lambda + k\sqrt{\hat{V}}$, and $\hat{V} = \frac{n}{X(n-X)}$.

Logit confidence interval.

## Anscombe

This interval was proposed by Anscombe (1956) and is based on the logit interval:

• Lower bound: $L = e^{\lambda_L}/(1+e^{\lambda_L})$
• Upper bound: $U = e^{\lambda_U}/(1+e^{\lambda_U})$

The difference is that $\lambda=\ln\{\frac{X+1/2}{n-X+1/2}\}$ and $\hat{V}=\frac{(n+1)(n+2)}{n(X+1)(n-X+1)}$. The values for $\lambda_L$ and $\lambda_U$ are as above.

Anscombe’s confidence interval.

## Octave/MATLAB implementation

A function that computes these intervals is available here: confint.m.

# False Discovery Rate: Corrected & Adjusted P-values

Until mid-1990’s, control of the error rate under multiple testing was done, in general, using family-wise error rate (FWER). An example of this kind of correction is the Bonferroni correction. In an influential paper, Benjamini and Hochberg (1995) introduced the concept of false discovery rate (FDR) as a way to allow inference when many tests are being conducted. Differently than FWER, which controls the probability of committing a type I error for any of a family of tests, FDR allows the researcher to tolerate a certain number of tests to be incorrectly discovered. The word rate in the FDR is in fact a misnomer, as the FDR is the proportion of discoveries that are false among all discoveries, i.e., proportion of incorrect rejections among all rejections of the null hypothesis.

## Benjamini and Hochberg’s FDR-controlling procedure

Consider testing $N$ hypotheses, $H_{1}, H_{2}, \ldots , H_{N}$ based on their respective p-values, $p_{1}, p_{2}, \ldots , p_{N}$. Consider that a fraction $q$ of discoveries are allowed (tolerated) to be false. Sort the p-values in ascending order, $p_{(1)} \leq p_{(2)} \leq \ldots \leq p_{(N)}$ and denote $H_{(i)}$ the hypothesis corresponding to $p_{(i)}$. Let $k$ be the largest $i$ for which $p_{(i)} \leq \frac{i}{N}\frac{q}{c(N)}$. Then reject all $H_{(i)}$, $i=1, 2, \ldots , k$. The constant $c(N)$ is not in the original publication, and appeared in Benjamini and Yekutieli (2001) for cases in which independence cannot be ascertained. The possible choices for the constant are $c(N)=1$ or $c(N)=\sum_{i=1}^{N} 1/i$. The second is valid in any situation, whereas the first is valid for most situations, particularly where there are no negative correlations among tests. The B&H procedure has found many applications across different fields, including neuroimaging, as introduced by Genovese et al. (2002).

## FDR correction

The procedure described above effectively defines a single number, a threshold, that can be used to declare tests as significant or not at the level $q$. One might, instead, be interested to know, for each original p-value, the minimum $q$ level in which they would still be rejected. To find this out, define $q_{(i)}=\frac{p_{(i)}N}{i}c(N)$. This formulation, however, has problems, as discussed next.

The problem with the FDR-correction is that $q_{(i)}$ is not a monotonic function of $p_{(i)}$. This means that if someone uses any $q_{(i)}$ to threshold the set of FDR-corrected values, the result is not the same as would be obtained by applying sequentially the B&H procedure for all these corresponding $q$ levels.

To address this concern, Yekutieli and Benjamini (1999) introduced the FDR-adjustment, in which monotonicity is enforced, and which definition is compatible with the original FDR definition. Let $q*_{(i)}$ be the FDR-adjusted value for $p_{(i)}$. It’s value is the smallest $q_{(k)}, k \geq i$, where $q_{(k)}$ is the FDR-corrected as defined above. That’s just it!

## Example

Consider the image below, on the left, a square of 9×9 pixels containing each some t-statistic with some sparse, low magnitude signal added. On the right are the corresponding p-values, ranging approximately between 0-1:

Statistical maps

Using the B&H procedure to compute the threshold, with $q=0.20$, and applying this threshold to the image rejects the null hypothesis in 6 pixels. On the right panel, the red line corresponds to the threshold. All p-values (in blue) below this line are declared significant.

FDR-threshold

Computing the FDR-corrected values at each pixel, and thresholding at the same $q=0.20$, however, does not produce the same results, and just 1 pixel is declared significant. Although conservative, the “correction” is far from the nominal false discovery rate and is not appropriate. Note on the right panel the lack of monotonicity.

FDR-corrected

Computing instead the FDR-adjusted values, and thresholding again at $q=0.20$ produces the same results as with the simpler FDR-threshold. The adjustment is, thus, more “correct” than the “corrected”. Observe, on the right panel, the monotonicity enforced.

## Implementation

An implementation of the three styles of FDR for Octave/MATLAB is available here: fdr.m
SPM users will find a similar feature in the function spm_P_FDR.m.

# Log-normality and the Box-Cox transformation

The Box-Cox transformation (Box and Cox, 1964) is a way to transform data that ordinarily do not follow to a normal distribution so that it then conforms to it. The transformation is a piecewise function of the power parameter $\lambda$:

The function is, given the definition, continuous at the singular point $\lambda = 0$. Plotting the transformation as a function of the original data $y$ and the parameter $\lambda$ produces the following:

An interesting aspect of the Box-Cox transformation is that it can be used as a metric to quantify how normal or log-normal certain data are. This has clear applications in biology. Tissue growth that depends on cellular multiplication is exponential, with a constant factor that is often normally distributed, resulting in tissue size that is log-normally distributed (for discussion, see McAlister, 1879; Koch, 1966; Koch, 1969). Measurements of the final tissue size in different individuals, however, may not perfectly conform to the log-normal due to external influences that may hinder tissue growth. Moreover, it is not always possible to measure the final amount of tissue that is the product of a single lineage of self-multiplicative cells. The combination of different cell lineages, each with their own growth rate, as well as external influences, tend to produce a distribution that is more normally (Gaussianly) distributed. A distribution gradient, therefore, may exist between these two extremes of normality and log-normality. Estimating the parameter $\lambda$ allows one to estimate also how closer to normal or log-normal certain measurements are. If $\lambda \approx 0$, the original data tend to be log-normally distributed, whereas if $\lambda \approx +1$, the data can be considered approximately more normally distributed.