# Extreme value notes

Extreme values are useful to quantify the risk of catastrophic floods, and much more.

This is a brief set of notes with an introduction to extreme value theory. For reviews, see Leadbetter et al (1983) and David and Huser (2015) [references at the end]. Also of some (historical) interest might be the classical book by Gumbel (1958). Let $X_1, \dots, X_n$ be a sequence of independent and identically distributed variables with cumulative distribution function (cdf) $F(x)$ and let $M_n =\max(X_1,\dots,X_n)$ denote the maximum.

If $F(x)$ is known, the distribution of the maximum is:

$\begin{array}{lll} P(M_n \leqslant x) &=&P(X_1 \leqslant x, \dots, X_n \leqslant x) \\ &=& P(X_1 \leqslant x) \cdots P(X_n \leqslant x) = F^n(x). \end{array}$

The distribution function $F(x)$ might, however, not be known. If data are available, it can be estimated, although small errors on the estimation of $F(x)$ can lead to large errors concerning the extreme values. Instead, an asymptotic result is given by the extremal types theorem, also known as Fisher-Tippett-Gnedenko Theorem, First Theorem of Extreme Values, or extreme value trinity theorem (called under the last name by Picklands III, 1975).

But before that, let’s make a small variable change. Working with $M_n$ directly is problematic because as $n \rightarrow \infty$, $F^n(x) \rightarrow 0$. Redefining the problem as a function of $M_n^* = \frac{M_n-b_n}{a_n}$ renders treatment simpler. The theorem can be stated then as: If there exist sequences of constants $a_n \in \mathbb{R}_{+}$ and $b_n \in \mathbb{R}$ such that, as $n \rightarrow \infty$:

$P\left(M_{n}^{*} \leqslant x \right) \rightarrow G(x)$

then $G(x)$ belongs to one of three “domains of attraction”:

• Type I (Gumbel law): $\Lambda(x) = e^{-e^{-\frac{x-b}{a}}}$, for $x \in \mathbb{R}$ indicating that the distribution of $M_n$ has an exponential tail.
• Type II (Fréchet law): $\Phi(x) = \begin{cases} 0 & x\leqslant b \\ e^{-\left(\frac{x-b}{a}\right)^{-\alpha}} & x\; \textgreater\; b \end{cases}$ indicating that the distribution of $M_n$ has a heavy tail (including polynomial decay).
• Type III (Weibull law): $\Psi(x) = \begin{cases} e^{-\left( -\frac{x-b}{a}\right)^\alpha} & x\;\textless\; b \\ 1 & x\geqslant b \end{cases}$ indicating that the distribution of $M_n$ has a light tail with finite upper bound.

Note that in the above formulation, the Weibull is reversed so that the distribution has an upper bound, as opposed to a lower one as in the Weibull distribution. Also, the parameterisation is slightly different than the one usually adopted for the Weibull distribution.

These three families have parameters $a\; \textgreater\; 0$, $b$ and, for families II and III, $\alpha\; \textgreater\; 0$. To which of the three a particular $F(x)$ is attracted is determined by the behaviour of the tail of of the distribution for large $x$. Thus, we can infer about the asymptotic properties of the maximum while having only a limited knowledge of the properties of $F(x)$.

These three limiting cases are collectively termed extreme value distributions. Types II and III were identified by Fréchet (1927), whereas type I was found by Fisher and Tippett (1928). In his work, Fréchet used $M_n^* = \frac{M_n}{a_n}$, whereas Fisher and Tippett used $M_n^* = \frac{M_n-b_n}{a_n}$. Von Mises (1936) identified various sufficient conditions for convergence to each of these forms, and Gnedenko (1943) established a complete generalisation.

## Generalised extreme value distribution

As shown above, the rescaled maxima converge in distribution to one of three families. However, all are cases of a single family that can be represented as:

$G(x) = e^{-\left(1-\xi\left(\frac{x-\mu}{\sigma}\right)\right)^{\frac{1}{\xi}}}$

defined on the set $\left\{x:1-\xi\frac{x-\mu}{\sigma}\;\textgreater\;0\right\}$, with parameters $-\infty \;\textless \;\mu\;\textless\; \infty$ (location), $\sigma\;\textgreater\;0$ (scale), and $-\infty\;\textless\;\xi\;\textless\;\infty$ (shape). This is the generalised extreme value (GEV) family of distributions. If $\xi \rightarrow 0$, it converges to Gumbel (type I), whereas if $\xi < 0$ it corresponds to Fréchet (type II), and if $\xi\;\textgreater\;0$ it corresponds to Weibull (type III). Inference on $\xi$ allows choice of a particular family for a given problem.

## Generalised Pareto distribution

For $u\rightarrow\infty$, the limiting distribution of a random variable $Y=X-u$, conditional on $X \;\textgreater\; u$, is:

$H(y) = 1-\left(1-\frac{\xi y}{\tilde{\sigma}}\right)^{\frac{1}{\xi}}$

defined for $y \;\textgreater\; 0$ and $\left(1-\frac{\xi y}{\tilde{\sigma}}\right) \;\textgreater\; 0$. The two parameters are the $\xi$ (shape) and $\tilde{\sigma}$ (scale). The shape corresponds to the same parameter $\xi$ of the GEV, whereas the scale relates to the scale of the former as $\tilde{\sigma}=\sigma-\xi(u-\mu)$.

The above is sometimes called the Picklands-Baikema-de Haan theorem or the Second Theorem of Extreme Values, and it defines another family of distributions, known as generalised Pareto distribution (GPD). It generalises an exponential distribution with parameter $\frac{1}{\tilde{\sigma}}$ as $\xi \rightarrow 0$, an uniform distribution in the interval $\left[0, \tilde{\sigma}\right]$ when $\xi = 1$, and a Pareto distribution when $\xi \;\textgreater\; 0$.

## Parameter estimation

By restricting the attention to the most common case of $-\frac{1}{2}<\xi<\frac{1}{2}$, which represent distributions approximately exponential, parametters for the GPD can be estimated using at least three methods: maximum likelihood, moments, and probability-weighted moments. These are described in Hosking and Wallis (1987). For $\xi$ outside this interval, methods have been discussed elsewhere (Oliveira, 1984). The method of moments is probably the simplest, fastest and, according to Hosking and Wallis (1987) and Knijnenburg et al (2009), has good performance for the typical cases of $-\frac{1}{2}<\xi<\frac{1}{2}$.

For a set of extreme observations, let $\bar{x}$ and $s^2$ be respectively the sample mean and variance. The moment estimators of $\tilde{\sigma}$ and $\xi$ are $\hat{\tilde{\sigma}} = \frac{\bar{x}}{2}\left(\frac{\bar{x}^2}{s^2}+1\right)$ and $\hat{\xi}=\frac{1}{2}\left(\frac{\bar{x}^2}{s^2}-1\right)$.

The accuracy of these estimates can be tested with, e.g., the Anderson-Darling goodness-of-fit test (Anderson and Darling, 1952; Choulakian and Stephens, 2001), based on the fact that, if the modelling is accurate, the p-values for the distribution should be uniformly distributed.

## Availability

Statistics of extremes are used in PALM as a way to accelerate permutation tests. More details to follow soon.

## References

The figure at the top (flood) is in public domain.

# Non-Parametric Combination (NPC) for brain imaging

Have you ever had an analysis in which there was a large set of contrasts, all of interest, and you were worried about multiple testing? An eventual effect would be missed by a simple Bonferroni correction, but you did not know what else to do? Or did you have a set of different studies and you wished to obtain a style of meta-analytic result, indicating whether there would be evidence across all of them, without requiring the studies to be all consistently significant?

The Non-Parametric Combination (NPC) solves these issues. It is a way of performing joint inference on multiple data collected on the same experimental units (e.g., same subjects), all with minimal assumptions. The method was proposed originally by Pesarin (1990, 1992) [see references below], independently by Blair and Karninski (1993), and described extensively by Pesarin and Salmaso (2010). In this blog entry, the NPC is presented in brief, with emphasis on the modifications we introduce to render it feasible for brain imaging. The complete details are in our paper that has just been published in the journal Human Brain Mapping.

## NPC in a nutshell

The NPC consists of, in a first phase, testing each hypothesis separately using permutations that are performed synchronously across datasets; these tests are termed partial tests. The resulting statistics for each and every permutation are recorded, allowing an estimate of the complete empirical null distribution to be constructed for each one. In a second phase, the empirical p-values for each statistic are combined, for each permutation, into a joint statistic. As such a combined joint statistic is produced from the previous permutations, an estimate of its empirical distribution function is immediately known, and so is the p-value of the joint test. A flowchart of the original algorithm is shown below; click to see it side-by-side with the modified one (described below).

## A host of combining functions

The null hypothesis of the NPC is that null hypotheses for all partial tests are true, and the alternative hypothesis that any is false, which is the same null of a union-intersection test (UIT; Roy, 1953). The rejection region depends on how the combined statistic is produced. Various combining functions, which produce such combined statistics, can be considered, and some of the most well known are listed in the table below:

Method Statistic p-value
Tippett $\min \left(p_{k}\right)$ $1-\left(1-T\right)^{K}$
Fisher $-2 \sum_{k=1}^{K} \ln\left(p_{k}\right)$ $1-\chi^{2}\left(T;\;\nu=2K\right)$
Stouffer $\frac{1}{\sqrt{K}} \sum_{k=1}^{K} \Phi^{-1}\left(1-p_{k}\right)$ $1-\Phi\left(T;\;\mu=0,\;\sigma^2=1\right)$
Mudholkar–George $\frac{1}{\pi}\sqrt{\frac{3(5K+4)}{K(5K+2)}}\sum_{k=1}^{K} \ln\left(\frac{1-p_{k}}{p_{k}}\right)$ $1-t_{\text{cdf}}(T;\;\nu=5K+4)$

In the table, $K$ is the number of partial tests, and the remaining of the variables follow the usual notation (see the Table 1 in the paper for the complete description). Many of these combining functions were proposed over the years for applications such as meta-analyses, and many of them assume independence between the tests being combined, and will give incorrect p-values if such assumption is not met. In the NPC, lack of dependence is not a problem, even if these same functions are used: the synchronised permutations ensure that any dependence, if existing, is taken into account, and this is done so implicitly, with no need for explicit modelling.

The different combining functions lead to different rejection regions for the null hypothesis. For the four combining functions in the table above, the respective rejection regions are in the figure below.

The combining functions can be modified to allow combination of tests so as to favour hypotheses with concordant directions, or be modified for bi-directional tests. Click on the figure above for examples of these cases (again, see the paper for the complete details).

## Two problems, one solution

The multiple testing problem is well known in brain imaging: as an image comprises thousands of voxels/vertices/faces, correction is necessary. Bonferroni is in general too conservative, and various other approaches have been proposed, such as the random field theory. Permutation tests provide control over the familywise error rate (FWER) for the multiple tests across space, requiring only the assumption of exchangeability. This is all well known; see Nichols and Hayasaka (2003) and Winkler et al. (2014) for details.

However, another type of multiple testing is also common: analyses that test multiple hypotheses using the same model, multiple pairwise group comparisons, multiple and distinct models, studies using multiple modalities, that mix imaging and non-imaging data, that consider multiple processing pipelines, and even multiple multivariate analyses. All these common cases also need multiple testing correction. We call this multiple testing problem MTP-II, to discern it from the well known multiple testing problem across space, described above, which we term MTP-I.

One of the many combining functions possible with NPC, the one proposed by Tippett (1931), has a further property that makes it remarkably interesting. The Tippett function uses the smallest p-value across partial tests as its test statistic. Alternatively, if all statistics are comparable, it can be formulated in terms of the maximum statistic. It turns out that the distribution of the maximum statistic across a set of tests is also the distribution that can be used in a closed testing procedure (Marcus et al., 1976) to correct for the familywise error rate (FWER) using resampling methods, such as permutation. In the context of joint inference, FWER-correction can also be seen as an UIT. Thus, NPC offers a link between combination of multiple tests, and correction for multiple tests, in both cases regardless of any dependence between such tests.

This means that the MTP-II, for which correction in the parametric realm is either non-existing or fiendishly difficult, can be accommodated easily. It requires no explicit modelling of the dependence between the tests, and the resulting error rates are controlled exactly at the test level, adding rigour to what otherwise could lead to an excess of false positives without correction, or be overly conservative if a naïve correction such as Bonferroni were attempted.

## Modifying for imaging applications

As originally proposed, in practice NPC cannot be used in brain imaging. As the statistics for all partial tests for all permutations need to be recorded, an enormous amount of space for data storage is necessary. Even if storage space were not a problem, the discreteness of the p-values for the partial tests is problematic when correcting for multiple testing, because with thousands of tests in an image, ties are likely to occur, further causing ties among the combined statistics. If too many tests across an image share the same most extreme statistic, correction for the MTP-I, while still valid, becomes less powerful (Westfall and Young, 1993; Pantazis et al., 2005). The most obvious workaround — run an ever larger number of permutations to break the ties — may not be possible for small sample sizes, or when possible, requires correspondingly larger data storage.

The solution is loosely based on the direct combination of the test statistics, by converting the test statistics of the partial tests to values that behave as p-values, using the asymptotic distribution of the statistics for the partial tests. We call these as “u-values”, in order to emphasise that they are not meant to be read or interpreted as p-values, but rather as transitional values that allow combinations that otherwise would not be possible.

For spatial statistics, the asymptotic distribution of the combined statistic is used, this time to produce a z-score, which can be subjected to the computation of cluster extent, cluster mass, and/or threshold-free cluster enhancement (TFCE; Smith and Nichols, 2009). A flow chart of the modified algorithm is shown below. Click to see it side-by-side with the original.

## More power, fewer assumptions

One of the most remarkable features of NPC is that the synchronised permutations implicitly account for the dependence structure among the partial tests. This means that even combining methods originally derived under the assumption of independence can be used when such independence is untenable. As the p-values are assessed via permutations, distributional restrictions are likewise not necessary, liberating NPC from most assumptions that thwart parametric methods in general. This renders NPC a good alternative to classical multivariate tests, such as MANOVA, MANCOVA, and Hotelling’s T2 tests: each of the response variables can be seen as an univariate partial test in the context of the combination, but without the assumptions that are embodied in these old multivariate tests.

As if all the above were not already sufficient, NPC is also more powerful than such classical multivariate tests. This refers to its finite sample consistency property, that is, even with fixed sample size, as the number of modalities being combined increases, the power of the test also increases. The power of classical multivariate tests, however, increases up to a certain point, then begins to decrease, eventually reaching zero when the number of combining variables match the sample size.

The figure below summarises the analysis of a subset of the subjects of a published FMRI study (Brooks et al, 2005) in which painful stimulation was applied to the face, hand, and foot of 12 subjects. Using permutation tests separately, no results could be identified for any of the three types of stimulation. A simple multivariate test, the Hotelling’s T2 test, even assessed using permutations, did not reveal any effect of stimulation either. The NPC results, however, suggest involvement of large portions of the anterior insula and secondary somatosensory cortex. The Fisher, Stouffer and Mudholkar–George combining functions were particularly successful in recovering a small area of activity in the midbrain and periaqueductal gray area, which would be expected from previous studies on pain, but that could not be located from the original, non-combined data.

Detailed assessment of power, using variable number of modalities, and of modalities containing signal, is shown in the paper.

## Combinations or conjunctions?

Combination, as done via NPC, is different than conjunctions (Nichols et al., 2005) in the following: in the combination, one seeks for aggregate significance across partial tests, without the need that any individual study is necessarily significant. In the conjunction, it is necessary that all of them, with no exception, is significant. As indicated above, the NPC forms an union-intersection test (UIT; Roy, 1953), whereas the conjunctions form an intersection-union test (IUT; Berger, 1982). The former can be said to be significant if any (or an aggregate) of the partial tests is significant, whereas the latter is significant if all the partial tests are.

## Availability

The NPC, with the modifications for brain imaging, is available in the tool PALM — Permutation Analysis of Linear Models. It runs in either Matlab or Octave, and is free (GPL).

## References

Contributed to this post: Tom Nichols.

# FSL on the Raspberry Pi

How about processing brain imaging data on a Raspberry Pi? The different versions of this little device have performed exceptionally well for education, entertainment, and for a variety of do-it-yourself projects, with many examples listed in websites such as Instructables and Adafruit. Most of these applications are not computationally as intensive. Yet, the small size, low power consumption, improved hardware in recent models, and low price, may make this feasible.

## The Pi 2

Released earlier this year, the Raspberry Pi 2 (Model B) features a quad-core 900 MHz ARM processor, 1 GB of RAM, GPU, 4 USB ports, 10/100 Mbps Ethernet, HDMI and audio outputs, camera and display ports, as well as a low level general purpose interface (GPIO), all in a portable board of 85.6 mm × 56.5 mm (the same size as a credit card). It is powered by a 5 V, 800 mA DC (4 W) source, and sold for £30 or less.

Differently than earlier models, which had a CPU based on the ARMv6, the Pi 2 uses an ARM Cortex-A7 processor, which on its turn based on the ARMv7 architecture. Although there are Linux distributions that can run on the earlier models (such as ports of DebianopenSUSE, and Fedora), this change widens potential applications, not only because there are more ports available for ARMv7 (e.g., openSUSEDebianCentOS, among others), but also, the higher performance suggests that somewhat heavier data processing can be considered.

It is also possible to assemble multiple Pis in a cluster, using distributed computing engines such as SLURM, TORQUE or SGE. The Pi has the core requisites: it runs on Linux and comes with a decent 10/100 Mbps Ethernet port, such that creating a system is a matter of assembling the pieces and configuring.

## Neuroimaging with a small footprint

With this relatively high amount of computing power in such a physically small size and affordable price, the question is immediate: It is feasible to do neuroimaging on the Pi? The availability of Linux distributions for ARM platforms suggest that yes. However, the binaries for imaging software distributed for popular platforms as x86 (i386) and x86-64 (amd64) cannot work directly. Rather, the applications would need to be compiled from source.

For the FMRIB Software Library (FSL), the source code can be downloaded and the compilation proceed. Much simpler than that, however, is to take a different route: FSL has been included in NeuroDebian. This alone does not seem helpful, as the packages in the repository are only for 32-bit and 64-bit PCs (the i386 and amd64 ports), and SPARC. However, these packages have made into the upstream Debian, which means they are available for all platforms for which Debian itself has been ported. This includes the ARMv7 that powers the Pi, for which the port armhf (for chips that use a hardware floating point unit) can be used.

The steps to have a working installation of FSL on the Pi 2 are described below. Other interesting software, such as FreeSurfer, would need to be compiled from the source. For SPM, there is no Matlab port at the moment, but Octave runs without problems, such that most functionalities are expected to work. Applications based on Java, such as Mango, work without problems.

## Requirements

The photo at the top shows the hardware assembly used for this article. The following is required:

• A Raspberry Pi 2 (Model B).
• Power source (can be the USB port of another computer).
• Micro SD card with at least 8 GB (below it is assumed 32 GB).
• Ethernet cable and a network that provides internet access.
• Optional: HDMI display and cable, USB keyboard, and possibly a USB mouse if a graphical system will be installed. Alternatively, a headless system also works, with access via SSH. Below it is assumed a display is connected.

## The procedure

Step 1: Download the system image kindly prepared by Sjoerd Simons, and uncompress it:

wget https://images.collabora.co.uk/rpi2/jessie-rpi2-20150705.img.gz
gunzip jessie-rpi2-20150705.img.gz


This image contains only a minimal set of Debian Jessie packages. It uses the kernel 3.18.5, and received a few firmware and boot tweaks that are specific to the Pi 2.

Step 2: Use your favourite utility to transfer the image to the micro SD card. For example, using Linux, run the following, replacing /dev/sdX for the letter corresponding to your SD card (warning: this will erase all data stored in the card):

dd bs=1M if=jessie-rpi2-20150705.img of=/dev/sdX


In some systems, the card may be in /dev/mmcblk0 instead in /dev/sdX. If a Linux machine is not available, but instead a Mac or even Windows, the instructions to install Raspbian also apply.

Step 3: Insert the card in the Pi and boot the system. In this image, the default root password is debian (you can change it for something sensible).

Step 4: The main partition in this disk image (mounted as /) has only 2.6 GB, which is not enough. Also, often more than 1 GB of memory is needed, so swap space for virtual memory is necessary. Use fdisk (as root) to expand the main partition and to create a new partition for swap. Usually this is done interactively. If the card has exactly 32 GB, the line below can be used directly, bypassing the interactive mode. It will define a main partition with 24 GB, and the remaining, about 5 GB, will be left for swap. For cards of different sizes, run fdisk manually, or change the line below accordingly.

printf "d\n2\nn\np\n\n\n+24G\nn\np\n\n\n\nt\n3\n82\nw\n" | fdisk -uc /dev/mmcblk0


Note that, when seen from the Pi, the SD card is at /dev/mmcblk0, not /dev/sdX.

Reboot (shutdown -r now), then after logging in again, run:

resize2fs /dev/mmcblk0p2
mkswap /dev/mmcblk0p3
swapon /dev/mmcblk0p3


The swapon command enables the swap partition for immediate use. To make the change permanent for the next reboot, edit the file /etc/fstab adding:

/dev/mmcblk0p3 swap swap defaults 0 0


Step 5: Edit the /etc/apt/sources.list so as to include the official Debian packages (you can replace the server for your favourite/closer mirror):

deb http://ftp.uk.debian.org/debian/ jessie main contrib non-free


Step 6: Refresh the cached list of packages, then install FSL:

apt-get update
apt-get install fsl


Step 7: The installation is almost ready. The downloaded packages do not have the “data” directory of FSL, which contains the atlases and standard space images. To obtain these, do one of the following:

• From a separate FSL installation (e.g., from a different computer), copy the contents of the \${FSLDIR}/data to the /usr/share/fsl/data of the newly installed system on the Raspberry Pi. This can be done over the network, via ssh, or after plugging in and mounting (with the correct privileges) the card in a different Linux system.
• If another computer with FSL installed is not available, download FSL for CentOS or Mac (at the end of the downloads page, under “Advanced Users”), then uncompress the downloaded file, and copy the whole contents of the data directory to /usr/share/fsl/data of the Pi via ssh
• If another computer is not at all accessible for this step, these files can be obtained using the Pi itself, from the command line. Logged in as root in the newly installed system, run:
cd /usr/share

• A last option is to skip this step, go to Step 9 below, then download and copy using a graphical web-browser from an installed desktop environment.

Step 8: Add this line to the file ~/.profile:

. /etc/fsl/fsl.sh


That’s it. All that is needed to run FSL from the command line has been done.

Step 9 (optional): The installation up to this step does not include a graphical user interface. To have one, install X and a desktop environment. For lightweight options, LXDE or XFCE can be considered. A screenshot of LXDE with FSLview and two terminal windows showing some system information is below (usually one would not run as root, but create an user account).

## Using Raspbian

The official operating system for the Raspberry Pi, Raspbian, is a customised version of Debian, thus capable of running FSL directly. However, FSL is not in the official rpi repository. It can still be installed following similar steps as above, remembering to use sudo with commands require root privileges (the default account is rpi and the password is raspberry), and with care in the repartitioning, as the official disk image uses a different scheme. In Step 5, include the same Debian package source in the /etc/apt/sources.list file.

## Overclocking

The Pi can be overclocked. Conservative, stable settings, that do not void the warranty, consist of increasing the CPU frequency to 1000 MHz (from the default 900), the GPU and SDRAM frequencies to 500 MHz (the defaults are 250 and 450 respectively), and the CPU/GPU voltage by 2 steps, i.e., by 2 × 25 mV, from 1.20 to 1.25 V. The overclock settings are adjusted in the file /boot/firmware/config.txt if using Debian (following the steps above), or in /boot/config.txt if using Raspbian:

arm_freq=1000
gpu_freq=500
sdram_freq=500
over_voltage=2


These settings cause the bogomips to jump from 38.40 to 64.00. The temperature of the onboard chips can increase, however, and a suggestion is to use heatsinks or fans, which are inexpensive and can be purchased online (fans would be powered by GPIO pins).

## Performance

With the system up and running, it is time for some benchmarks. Although the assembly is exciting and in general the system speed respectable, unfortunately, processing using FEEDS suggests a poor performance. The table below compares the timings of the Pi 2 with default versus overclocked settings, relative to a minimal install of the Debian Jessie on a notebook with an Intel Core i5 processor and 8 GB of RAM.

Default settings Overclocked settings
PRELUDE & FUGUE  6.0  4.8
SUSAN  15.1  11.9
SIENAX  13.3  10.4
BET2  12.3  9.4
FEAT  12.1  9.6
MELODIC  15.9  12.2
FIRST  14.0  11.1
FDT  7.4  5.9
FNIRT  26.6  19.3
Total time  12.2  9.5

Running the whole FEEDS took 22 minutes in the Intel Core i5, whereas in the Pi 2 it took 4h29min with the default settings, and 3h30min after overclocking. It should be noted, however, that the 1 GB of RAM is not sufficient to run the test without using virtual memory (swapping). This needs to be taken into account when evaluating the table above. The SD card used for the tests is a Class 10, which is not as fast as actual RAM (faster cards would have their performance curtailed by hardware limits).

The performances of Debian and Raspbian on the Pi 2 are nearly identical. Running in the graphical mode (at least with LXDE) or in a console-only system do not seem to impact results, at least as far only one instance of FEEDS was running.

## Conclusion

It is possible to run FSL on the Raspberry Pi 2, and the procedure is not too different than doing the same in an ordinary computer. The performance, however, suggests that the current model, being about ten times slower, may not be a competitive choice for brain imaging.

# Permutation tests in the Human Connectome Project

Permutation tests are known to be superior to parametric tests: they are based on only few assumptions, essentially that the data are exchangeable, and allow the correction for the multiplicity of tests and the use of various non-standard statistics. The exchangeability assumption allows data to be permuted whenever their joint distribution remains unaltered. Usually this means that each observation needs to be independent from the others.

In many studies, however, there are repeated measurements on the same subjects, which violates exchangeability: clearly, the various measurements obtained from a given subject are not independent from each other. In the Human Connectome Project (HCP) (Van Essen et al, 2012; 2013; see references at the end), subjects are sampled along with their siblings (most of them are twins), such that independence cannot be guaranteed either.

In Winkler et al. (2014), certain structured types of non-independence in brain imaging were addressed through the definition of exchangeability blocks (EBs). Observations within EB can be shuffled freely or, alternatively, the EBs themselves can be shuffled as a whole. This allows various designs that otherwise could not be assessed through permutations.

The same idea can be generalised for blocks that are nested within other blocks, in a multi-level fashion. In the paper Multi-level Block Permutation (Winkler et al., 2015) we presented a method that allows blocks to be shuffled a whole, and inside them, sub-blocks are further allowed to be shuffled, in a recursive process. The method is flexible enough to accommodate permutations, sign-flippings (sometimes also called “wild bootstrap”), and permutations together with sign-flippings.

In particular, this permutation scheme allows the data of the HCP to be analysed via permutations: subjects are allowed to be shuffled with their siblings while keeping the joint distribution intra-sibship maintained. Then each sibship is allowed to be shuffled with others of the same type.

In the paper we show that the error type I is controlled at the nominal level, and the power is just marginally smaller than that would be obtained by permuting freely if free permutation were allowed. The more complex the block structure, the larger the reductions in power, although with large sample sizes, the difference is barely noticeable.

Importantly, simply ignoring family structure in designs as this causes the error rates not to be controlled, with excess false positives, and invalid results. We show in the paper examples of false positives that can arise, even after correction for multiple testing, when testing associations between cortical thickness, cortical area, and measures of body size as height, weight, and body-mass index, all of them highly heritable. Such false positives can be avoided with permutation tests that respect the family structure.

The figure at the top shows how the subjects of the HCP (terminal dots, shown in white colour) can be shuffled or not, while respecting the family structure. Blue dots indicate branches that can be permuted, whereas red dots indicate branches that cannot (see the main paper for details). This diagram includes 232 subjects of an early public release of HCP data. The tree on the left considers dizygotic twins as a category on their own, i.e., that cannot be shuffled with ordinary siblings, whereas the tree on the right considers dizygotic twins as ordinary siblings.

The first applied study using our strategy has just appeared. The method is implemented in the freely available package PALM — Permutation Analysis of Linear Models, and a set of practical steps to use it with actual HCP data is available here.

# Another look into Pillai’s trace

In a previous post, all commonly used univariate and multivariate test statistics used with the general linear model (GLM) were presented. Here an alternative formulation for one of these statistics, the Pillai’s trace (Pillai, 1955, references at the end), commonly used in MANOVA and MANCOVA tests, is presented.

We begin with a multivariate general linear model expressed as:

$\mathbf{Y} = \mathbf{M} \boldsymbol{\psi} + \boldsymbol{\epsilon}$

where $\mathbf{Y}$ is the $N \times K$ full rank matrix of observed data, with $N$ observations of $K$ distinct (possibly non-independent) variables, $\mathbf{M}$ is the full-rank $N \times R$ design matrix that includes explanatory variables (i.e., effects of interest and possibly nuisance effects), $\boldsymbol{\psi}$ is the $R \times K$ vector of $R$ regression coefficients, and $\boldsymbol{\epsilon}$ is the $N \times K$ vector of random errors. Estimates for the regression coefficients can be computed as $\boldsymbol{\hat{\psi}} = \mathbf{M}^{+}\mathbf{Y}$, where the superscript ($^{+}$) denotes a pseudo-inverse.

## The null hypothesis, and a simplification

One is generally interested in testing the null hypothesis that a contrast of regression coefficients is equal to zero, i.e., $\mathcal{H}_{0} : \mathbf{C}'\boldsymbol{\psi}\mathbf{D} = \boldsymbol{0}$, where $\mathbf{C}$ is a $R \times S$ full-rank matrix of $S$ contrasts of coefficients on the regressors encoded in $\mathbf{X}$, $1 \leqslant S \leqslant R$ and $\mathbf{D}$ is a $K \times Q$ full-rank matrix of $Q$ contrasts of coefficients on the dependent, response variables in $\mathbf{Y}$, $1 \leqslant Q \leqslant K$; if $K=1$ or $Q=1$, the model is univariate. Once the hypothesis has been established, $\mathbf{Y}$ can be equivalently redefined as $\mathbf{Y}\mathbf{D}$, such that the contrast $\mathbf{D}$ can be omitted for simplicity, and the null hypothesis stated as $\mathcal{H}_{0} : \mathbf{C}'\boldsymbol{\psi} = \boldsymbol{0}$.

## Model partitioning

It is useful to consider a transformation of the model into a partitioned one:

$\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon}$

where $\mathbf{X}$ is the matrix with regressors of interest, $\mathbf{Z}$ is the matrix with nuisance regressors, and $\boldsymbol{\beta}$ and $\boldsymbol{\gamma}$ are respectively the vectors of regression coefficients. From this model we can also define the projection (hat) matrices $\mathbf{H}_{\mathbf{X}}=\mathbf{X}\mathbf{X}^{+}$ and $\mathbf{H}_{\mathbf{Z}}=\mathbf{Z}\mathbf{Z}^{+}$ due to tue regressors of interest and nuisance, respectively, and the residual-forming matrices $\mathbf{R}_{\mathbf{X}}=\mathbf{I}-\mathbf{H}_{\mathbf{X}}$ and $\mathbf{R}_{\mathbf{Z}}=\mathbf{I}-\mathbf{H}_{\mathbf{Z}}$.

Such partitioning is not unique, and schemes can be as simple as separating apart the columns of $\mathbf{M}$ as $\left[ \mathbf{X} \; \mathbf{Z} \right]$, with $\boldsymbol{\psi} = \left[ \boldsymbol{\beta}' \; \boldsymbol{\gamma}' \right]'$. More involved strategies can, however, be devised to obtain some practical benefits. One such partitioning is to define $\mathbf{X} = \mathbf{M} \mathbf{Q} \mathbf{C} \left(\mathbf{C}'\mathbf{Q}\mathbf{C}\right)^{-1}$ and
$\mathbf{Z} = \mathbf{M} \mathbf{Q} \mathbf{C}_v \left(\mathbf{C}_v'\mathbf{Q}\mathbf{C}_v\right)^{-1}$, where $\mathbf{Q}=(\mathbf{M}'\mathbf{M})^{-1}$, $\mathbf{C}_v=\mathbf{C}_u-\mathbf{C}(\mathbf{C}'\mathbf{Q}\mathbf{C})^{-1}\mathbf{C}'\mathbf{Q}\mathbf{C}_u$, and $\mathbf{C}_u$ has $r-\mathsf{rank}\left(\mathbf{C}\right)$ columns that span the null space of $\mathbf{C}$, such that $[\mathbf{C} \; \mathbf{C}_u]$ is a $r \times r$ invertible, full-rank matrix (Smith et al, 2007). This partitioning has a number of features: $\boldsymbol{\hat{\beta}} = \mathbf{C}'\boldsymbol{\hat{\psi}}$, $\widehat{\mathsf{Cov}}(\boldsymbol{\hat{\beta}}) = \mathbf{C}'\widehat{\mathsf{Cov}}(\boldsymbol{\hat{\psi}})\mathbf{C}$, i.e., estimates and variances of $\boldsymbol{\beta}$ for inference on the partitioned model correspond exactly to the same inference on the original model, $\mathbf{X}$ is orthogonal to $\mathbf{Z}$, and $\mathsf{span}(\mathbf{X}) \cup \mathsf{span}(\mathbf{Z}) = \mathsf{span}(\mathbf{M})$, i.e., the partitioned model spans the same space as the original.

Another partitioning scheme, derived by Ridgway (2009), defines $\mathbf{X}=\mathbf{M}(\mathbf{C}^+)'$ and $\mathbf{Z}=\mathbf{M}-\mathbf{M}\mathbf{C}\mathbf{C}^{+}$. As with the previous strategy, the parameters of interest in the partitioned model are equal to the contrast of the original parameters. A full column rank nuisance partition can be obtained from the singular value decomposition (SVD) of $\mathbf{Z}$, which will also provide orthonormal columns for the nuisance partition. Orthogonality between regressors of interest and nuisance can be obtained by redefining the regressors of interest as $\mathbf{R}_{\mathbf{Z}}\mathbf{X}$.

## The usual multivariate statistics

For the multivariate statistics, define generically:

$\mathbf{H}=(\mathbf{C}'\boldsymbol{\hat{\psi}}\mathbf{D})' \left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1} (\mathbf{C}'\boldsymbol{\hat{\psi}}\mathbf{D})$

as the sums of products explained by the model (hypothesis) and:

$\mathbf{E} = (\boldsymbol{\hat{\epsilon}}\mathbf{D})'(\boldsymbol{\hat{\epsilon}}\mathbf{D})$

as the sums of the products of the residuals, i.e., that remain unexplained. With the simplification to the original model that redefined $\mathbf{Y}$ as $\mathbf{Y}\mathbf{D}$, the $\mathbf{D}$ can be dropped, so that we have $\mathbf{H}=(\mathbf{C}'\boldsymbol{\hat{\psi}})' \left(\mathbf{C}'(\mathbf{M}'\mathbf{M})^{-1}\mathbf{C} \right)^{-1} (\mathbf{C}'\boldsymbol{\hat{\psi}})$ and $\mathbf{E} = \boldsymbol{\hat{\epsilon}}'\boldsymbol{\hat{\epsilon}}$. The various well-known multivariate statistics (see this earlier blog entry) can be written as a function of $\mathbf{H}$ and $\mathbf{E}$. Pillai’s trace is:

$T=\mathsf{trace}\left(\mathbf{H}(\mathbf{H}+\mathbf{E})^{-1}\right)$

## More simplifications

With the partitioning, other simplifications are possible:

$\mathbf{H}=\boldsymbol{\hat{\beta}}' (\mathbf{X}'\mathbf{X})\boldsymbol{\hat{\beta}} = ( \mathbf{X}\boldsymbol{\hat{\beta}})'(\mathbf{X}\boldsymbol{\hat{\beta}})$

Recalling that $\mathbf{X}'\mathbf{Z}=\mathbf{0}$, and defining $\tilde{\mathbf{Y}}=\mathbf{R}_{\mathbf{Z}}\mathbf{Y}$, we have:

$\mathbf{H} = (\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}})'(\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}}) = \tilde{\mathbf{Y}}'\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}}$

The unexplained sums of products can be written in a similar manner:

$\mathbf{E} = (\mathbf{R}_{\mathbf{X}}\tilde{\mathbf{Y}})'(\mathbf{R}_{\mathbf{X}}\tilde{\mathbf{Y}}) = \tilde{\mathbf{Y}}'\mathbf{R}_{\mathbf{X}}\tilde{\mathbf{Y}}$

The term $\mathbf{H}+\mathbf{E}$ in the Pillai’s trace can therefore be rewritten as:

$\mathbf{H}+\mathbf{E}= \tilde{\mathbf{Y}}'(\mathbf{H}_{\mathbf{X}} + \mathbf{R}_{\mathbf{X}})\tilde{\mathbf{Y}} = \tilde{\mathbf{Y}}'\tilde{\mathbf{Y}}$

Using the property that the trace of a product is invariant to a circular permutation of the factors, Pillai’s statistic can then be written as:

$\begin{array}{ccl} T&=&\mathsf{trace}\left(\tilde{\mathbf{Y}}'\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}}\left(\tilde{\mathbf{Y}}'\tilde{\mathbf{Y}}\right)^{+}\right)\\ {}&=&\mathsf{trace}\left(\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}}\left(\tilde{\mathbf{Y}}'\tilde{\mathbf{Y}}\right)^{+}\tilde{\mathbf{Y}}'\right)\\ {}&=&\mathsf{trace}\left(\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}}\tilde{\mathbf{Y}}^{+}\right)\\ \end{array}$

## The final, alternative form

Using sigular value decomposition we have $\tilde{\mathbf{Y}} = \mathbf{U}\mathbf{S}\mathbf{V}'$ and $\tilde{\mathbf{Y}}^{+} = \mathbf{V}\mathbf{S}^{+}\mathbf{U}'$, where $\mathbf{U}$ contains only the columns that correspond to non-zero eigenvalues. Thus, the above can be rewritten as:

$\begin{array}{ccl} T&=&\mathsf{trace}\left(\mathbf{H}_{\mathbf{X}} \mathbf{U}\mathbf{S}\mathbf{V}' \mathbf{V}\mathbf{S}^{+}\mathbf{U}' \right)\\ {}&=&\mathsf{trace}\left(\mathbf{H}_{\mathbf{X}} \mathbf{U}\mathbf{U}' \right)\\ \end{array}$

The SVD transformation is useful for languages or libraries that offer a fast implementation. Otherwise, using a pseudoinverse yields the same result, perhaps only slightly slower. In this case, $T=\mathsf{trace}\left(\mathbf{H}_{\mathbf{X}}\tilde{\mathbf{Y}}\tilde{\mathbf{Y}}^{+}\right)$.

## Importance

If we define $\mathbf{A}\equiv\mathbf{H}_{\mathbf{X}}$ and $\mathbf{W}\equiv\mathbf{U}\mathbf{U}'$ (or $\mathbf{W}\equiv\tilde{\mathbf{Y}}\tilde{\mathbf{Y}}^{+}$), then $T=\mathsf{trace}\left(\mathbf{A}\mathbf{W}\right)$. The first three moments of the permutation distribution of statistics that can be written in this form can be computed analytically once $\mathbf{A}$ and $\mathbf{W}$ are known. With the first three moments, a gamma distribution (Pearson type III) can be fit, thus allowing p-values to be computed without resorting to the usual beta approximation to Pillai’s trace, nor using permutations, yet with results that are not based on the assumption of normality (Mardia, 1971; Kazi-Aoual, 1995; Minas and Montana, 2014).

## Availability

This simplification is available in PALM, for use with imaging and non-imaging data, using Pillai’s trace itself, or a modification that allows inference on univariate statistics. As of today, this option is not yet documented, but should become openly available soon.

## References

Update: 20.Jan.2016: A slight simplification was applied to the formulas above so as to make them more elegant and remove some redundancy. The result is the same.

# The lady tasting tea experiment

Can you tell?

The now famous story is that in an otherwise unremarkable summer afternoon in Cambridge in the 1920’s, a group of friends eventually discussed about the claims made by one of the presents about her abilities on discriminating whether milk was poured first or last when preparing a cup of tea with milk. One of the presents was Ronald Fisher, and the story, along with a detailed description of how to conduct a simple experiment to test the claimed ability, and how to obtain an exact solution, was presented at length in the Chapter 2 of his book The Design of Experiments, a few lines of which are quoted below:

A lady declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was first added to the cup. We will consider the problem of designing an experiment by means of which this assertion can be tested. […] [It] consists in mixing eight cups of tea, four in one way and four in the other, and presenting them to the subject for judgment in a random order. The subject has been told in advance of that the test will consist, namely, that she will be asked to taste eight cups, that these shall be four of each kind […]. — Fisher, 1935.

There are $\frac{8!}{4!4!}=70$ distinct possible orderings of these cups, and by telling the subject in advance that there are four cups of each type, this guarantees that the answer will include four of each.

The lady in question eventually answered correctly six out of the eight trials. The results can be assembled in a 2 by 2 contingency table:

True order: Total (margin)
Tea first Milk first
Lady’s Guesses: Tea first $a=3$ $b=1$ $a+b=4$
Milk first $c=1$ $d=3$ $c+d=4$
Total (margin) $a+c=4$ $b+d=4$ $n=8$

With these results, what should be concluded about the ability of the lady in discriminating whether milk or tea was poured first? It is not possible to prove that she would never be wrong, because if a sufficiently large number of cups of tea were offered, a single failure would disprove such hypothesis. However, a test that she is never right can be disproven, with a certain margin of uncertainty, given the number of cups offered.

## Solution using Fisher’s exact method

Fisher presented an exact solution for this experiment. It is exact in the sense that it allows an exact probability to be assigned to each of the possible outcomes. The probability can be calculated as:

$P_{\text{Fisher}} = \dfrac{(a+b)!(c+d)!(a+c)!(b+d)!}{n!a!b!c!d!}$

For the particular configuration of the contingency table above, the probability is $\frac{16}{70} = 0.22857$. This is not the final result, though: what matters to disprove the hypothesis that she is not able to discriminate is how likely it would be for her to find a result at least as extreme as the one observed. In this case, there is one case that is more extreme, which would be the one in which she would have made correct guesses for all the 8 cups, in which case the values in the contingency table above would have been $a = 4$, $b = 0$, $c = 0$, and $d = 4$, with a probability computed with the same formula as $\frac{1}{70} = 0.01429$. Adding these two probabilities together yield $\frac{16+1}{70}=0.24286$.

Thus, if the lady is not able to discriminate whether tea or milk was poured first, the chance of observing a result at least as favourable towards her claim would be 0.24286, i.e., about 24%.

If from the outset we were willing to consider a significance level 0.05 (5%) as an informal rule to disprove the null hypothesis, we would have considered the p-value = 0.24286 as non-significant. This p-value is exact, a point that will become more clear below, in the section about permutation tests.

## Using the hypergeometric distribution directly

The above process can become laborious for experiments with larger number of trials. An alternative, but equivalent solution, is to appeal directly to the hypergeometric distribution. The probability mass function (pmf) of this distribution can be written as a function the parameters of the contingency table as:

$P(X=a) = \dfrac{\binom{a+b}{a}\binom{c+d}{c}}{\binom{n}{a+c}}$

The pmf is equivalent to Fisher’s exact formula to compute the probability of a particular configuration. The cumulative density function, which is conditional on the margins being fixed, is:

$P(X \geqslant a) = \sum_{j=a}^{J}\dfrac{\binom{a+b}{j}\binom{c+d}{a+c-j}}{\binom{n}{a+c}} = \sum_{j=a}^{J}\dfrac{(a+b)!(c+d)!(a+c)!(b+d)!}{n!j!(a+b-j)!(a+c-j)!(d-a+j)!}$

where $J=\min(a+c,a+b)$. Computing from the above (details omitted), yield the same value as using Fisher’s presentation, that is, the p-value is (exactly) 0.24286.

## Solution using Pearson’s $\chi^2$ method

Much earlier than the tea situation described above, Karl Pearson had already considered the problem of inference in contingency tables, having proposed a test based on a $\chi^2$ statistic.

$\chi^2 = \sum_{i=1}^{R}\sum_{j=1}^{C}\dfrac{(O_{ij}-E_{ij})^2}{E_{ij}}$

where $O_{ij}$ is the observed value for the element in the position $(i,j)$ in the table, $R$ and $C$ are respectively the number of rows and columns, and $E_{ij}$ is the expected value for these elements if the null hypothesis is true. The values $E_{ij}$ can be computed as the product of the marginals for row $i$ and column $j$, divided by the overall number of observations $n$. A simpler, equivalent formula is given by:

$\chi^2 = \dfrac{n(ad - bc)^2}{(a+b)(c+d)(a+c)(b+d)}$

A p-value can be computed from the $\chi^2$ distribution with degrees of freedom $\nu=(R-1)(C-1)$.

Under the null, we can expect a value equal to 2 in each of the 2 cells, that is, the lady would for each cup have a 50:50 chance of answering correctly. For the original tea tasting experiment, Pearson’s method give quite inaccurate results: $\chi^2=2$, which corresponds to a p-value of 0.07865. However, it is well known that this method is inaccurate if cells in the table have too small quantities, usually below 5 or 6.

## Improvement using Yates’ continuity correction

To solve the above well-known issue with small quantities, Yates (1934) proposed a correction, such that the test statistic becomes:

$\chi^2 = \sum_{i,j}\dfrac{\left(|O_{ij}-E_{ij}|-\frac{1}{2}\right)^2}{E_{ij}}$

Applying this correction to the original tea experiment gives $\chi^2=0.5$, and a p-value of 0.23975, which is very similar to the one given by the Fisher method. Note again that this approach, like the $\chi^2$ test, predates Fisher’s exact test.

## Equivalence of Fisher’s exact test and permutation tests

The method proposed by Fisher corresponds to a permutation test. Let $\mathbf{x}$ be a column vector containing binary indicators for whether milk was truly poured first. Let $\mathbf{y}$ be a column vector containing binary indicators for whether the lady answered that milk was poured first. The general linear model (GLM) can be used as usual, such that $\mathbf{y}=\mathbf{x}\beta + \boldsymbol{\epsilon}$, where $\beta$ is a regression coefficient, and $\boldsymbol{\epsilon}$ are the residuals.

Under the null hypothesis that the lady cannot discriminate, the binary values in $\mathbf{y}$ can be permuted randomly. There are $\binom{8}{4}=70$ possible unique rearrangements. Out of these, in 17, there are 6 or more (out of 8) correct answers matching the values in $\mathbf{x}$, which gives a p-value 17/70 = 0.24286.

Note that the strategy using the GLM can be used even if both variables $\mathbf{x}$ and $\mathbf{y}$ are binary, as in the example of the tea tasting, even if the residuals are not normally distributed (permutation tests do not depend on distributional assumptions), and even considering that values in $\beta$ can lead to non-sensical predictions in $\mathbf{y}$, as prediction is not the objective of this analysis.

## Why not a binomial test?

The binomial test could be considered if the lady did not know in advance that there were 4 cups of each mixture order. Since she knew (so the two margins of the table are fixed), each cup was not independent from each other, and her possible answers had to be constrained by answers previously given. The binomial test assumes independence, thus, is not an option for this analysis.

## Relevance

Using this simple experiment, Fisher established most of the fundamental principles for hypothesis testing, which contributed to major advances across biological and physical sciences. A careful read of the original text shows a precise use of terms, in a concise and unambiguous presentation, in contrast with many later and more verbose textbooks that eventually hid from readers most of the fundamental principles.

## References

The photograph at the top (tea with milk) is in public domain.

# Variance components in genetic analyses

Pedigree-based analyses allow investigation of genetic and environmental influences on anatomy, physiology, and behaviour.

Methods based on components of variance have been used extensively to assess genetic influences and identify loci associated with various traits quantifying aspects of anatomy, physiology, and behaviour, in both normal and pathological conditions. In an earlier post, indices of genetic resemblance between relatives were presented, and in the last post, the kinship matrix was defined. In this post, these topics are used to present a basic model that allows partitioning of the phenotypic variance into sources of variation that can be ascribed to genetic, environmental, and other factors.

## A simple model

Consider the model:

$\mathbf{Y} = \mathbf{X}\mathbf{B} + \boldsymbol{\Upsilon}$

where, for $S$ subjects, $T$ traits, $P$ covariates and $K$ variance components, $\mathbf{Y}_{S \times T}$ are the observed trait values for each subject, $\mathbf{X}_{S \times P}$ is a matrix of covariates, $\mathbf{B}_{P \times T}$ is a matrix of unknown covariates’ weights, and $\boldsymbol{\Upsilon}_{S \times T}$ are the residuals after the covariates have been taken into account.

The elements of each column $t$ of $\boldsymbol{\Upsilon}$ are assumed to follow a multivariate normal distribution $\mathcal{N}\left(0;\mathbf{S}\right)$, where $\mathbf{S}$ is the between-subject covariance matrix. The elements of each row $s$ of $\boldsymbol{\Upsilon}$ are assumed to follow a normal distribution $\mathcal{N}\left(0;\mathbf{R}\right)$, where $\mathbf{R}$ is the between-trait covariance matrix. Both $\mathbf{R}$ and $\mathbf{S}$ are seen as the sum of $K$ variance components, i.e. $\mathbf{R} = \sum_{k} \mathbf{R}_{k}$ and $\mathbf{S} = \sum_{k} \mathbf{S}_{k}$. For a discussion on these equalities, see Eisenhart (1947) [see references at the end].

## An equivalent model

The same model can be written in an alternative way. Let $\mathbf{y}_{S \cdot T \times 1}$ be the stacked vector of traits, $\mathbf{\tilde{X}}_{S \cdot T \times P \cdot T} = \mathbf{X} \otimes \mathbf{I}_{T \times T}$ is the matrix of covariates, $\boldsymbol{\beta}_{P \cdot T \times 1}$ the vector with the covariates’ weights, $\boldsymbol{\upsilon}_{S \cdot T \times 1}$ the residuals after the covariates have been taken into account, and $\otimes$ represent the Kronecker product. The model can then be written as:

$\mathbf{y} = \mathbf{\tilde{X}}\boldsymbol{\beta} + \boldsymbol{\upsilon}$

The stacked residuals $\boldsymbol{\upsilon}$ is assumed to follow a multivariate normal distribution $\mathcal{N}\left(0;\boldsymbol{\Omega}\right)$, where $\boldsymbol{\Omega}$ can be seen as the sum of $K$ variance components:

$\boldsymbol{\Omega} = \sum_{k} \mathbf{R}_k \otimes \mathbf{S}_k$

The $\boldsymbol{\Omega}$ here is the same as in Almasy and Blangero (1998). $\mathbf{S}_k$ can be modelled as correlation matrices. The associated scalars are absorbed into the (to be estimated) $\mathbf{R}_k$. $\mathbf{R}$ is the phenotypic covariance matrix between the residuals of the traits:

$\mathbf{R} = \left[ \begin{array}{ccc} \mathsf{Var}(\upsilon_1) & \cdots & \mathsf{Cov}(\upsilon_1,\upsilon_T) \\ \vdots & \ddots & \vdots \\ \mathsf{Cov}(\upsilon_T,\upsilon_1) & \cdots & \mathsf{Var}(\upsilon_T) \end{array}\right]$

whereas each $\mathbf{R}_k$ are the share of these covariances attributable to the $k$-th component:

$\mathbf{R}_k = \left[ \begin{array}{ccccc} \mathsf{Var}_k(\upsilon_1) & \cdots & \mathsf{Cov}_k(\upsilon_1,\upsilon_T) \\ \vdots & \ddots & \vdots \\ \mathsf{Cov}_k(\upsilon_T,\upsilon_1) & \cdots & \mathsf{Var}_k(\upsilon_T) \end{array}\right]$

$\mathbf{R}$ can be converted to a between-trait phenotypic correlation matrix $\mathbf{\mathring{R}}$ as:

$\mathbf{\mathring{R}} = \left[ \begin{array}{ccc} \frac{\mathsf{Var}(\upsilon_1)}{\mathsf{Var}(\upsilon_1)} & \cdots & \frac{\mathsf{Cov}(\upsilon_1,\upsilon_T)}{\left(\mathsf{Var}(\upsilon_1)\mathsf{Var}(\upsilon_T)\right)^{1/2}} \\ \vdots & \ddots & \vdots \\ \frac{\mathsf{Cov}(\upsilon_1,\upsilon_T)}{\left(\mathsf{Var}(\upsilon_1)\mathsf{Var}(\upsilon_T)\right)^{1/2}} & \cdots & \frac{\mathsf{Var}(\upsilon_T)}{\mathsf{Var}(\upsilon_T)} \end{array}\right]$

The above phenotypic correlation matrix has unit diagonal and can still be fractioned into their $K$ components:

$\mathbf{\mathring{R}}_k = \left[ \begin{array}{ccc} \frac{\mathsf{Var}_k(\upsilon_1)}{\mathsf{Var}(\upsilon_1)} & \cdots & \frac{\mathsf{Cov}_k(\upsilon_1,\upsilon_T)}{\left(\mathsf{Var}(\upsilon_1)\mathsf{Var}(\upsilon_T)\right)^{1/2}} \\ \vdots & \ddots & \vdots \\ \frac{\mathsf{Cov}_k(\upsilon_T,\upsilon_1)}{\left(\mathsf{Var}(\upsilon_T)\mathsf{Var}(\upsilon_1)\right)^{1/2}} & \cdots & \frac{\mathsf{Var}_k(\upsilon_T)}{\mathsf{Var}(\upsilon_T)} \end{array}\right]$

The relationship $\mathbf{\mathring{R}} = \sum_k \mathbf{\mathring{R}}_k$ holds. The diagonal elements of $\mathbf{\mathring{R}}_k$ may receive particular names, e.g., heritability, environmentability, dominance effects, shared enviromental effects, etc., depending on what is modelled in the corresponding $\mathbf{S}_k$. However, the off-diagonal elements of $\mathbf{\mathring{R}}_k$ are not the $\rho_k$ that correspond, e.g. to the genetic or environmental correlation. These off-diagonal elements are instead the signed $\text{\textsc{erv}}$ when $\mathbf{S}_k=2\cdot\boldsymbol{\Phi}$, or their $\text{\textsc{erv}}_k$-equivalent for other variance components (see below). In this particular case, they can also be called “bivariate heritabilities” (Falconer and MacKay, 1996). A matrix $\mathbf{\breve{R}}_k$ that contains these correlations $\rho_k$, which are the fraction of the variance attributable to the $k$-th component that is shared between pairs of traits is given by:

$\mathbf{\breve{R}}_k = \left[ \begin{array}{ccc} \frac{\mathsf{Var}_k(\upsilon_1)}{\mathsf{Var}_k(\upsilon_1)} & \cdots & \frac{\mathsf{Cov}_k(\upsilon_1,\upsilon_T)}{\left(\mathsf{Var}_k(\upsilon_1)\mathsf{Var}_k(\upsilon_T)\right)^{1/2}} \\ \vdots & \ddots & \vdots \\ \frac{\mathsf{Cov}_k(\upsilon_T,\upsilon_1)}{\left(\mathsf{Var}_k(\upsilon_T)\mathsf{Var}_k(\upsilon_1)\right)^{1/2}} & \cdots & \frac{\mathsf{Var}_k(\upsilon_T)}{\mathsf{Var}_k(\upsilon_T)} \end{array}\right]$

As for the phenotypic correlation matrix, each $\mathbf{\breve{R}}_k$ has unit diagonal.

## The most common case

A particular case is when $\mathbf{S}_1 = 2\cdot\boldsymbol{\Phi}$, the coefficient of familial relationship between subjects, and $\mathbf{S}_2 = \mathbf{I}_{S \times S}$. In this case, the $T$ diagonal elements of $\mathbf{\mathring{R}}_1$ represent the heritability ($h_t^2$) for each trait $t$. The diagonal of $\mathbf{\mathring{R}}_2$ contains $1-h_t^2$, the environmentability. The off-diagonal elements of $\mathbf{\mathring{R}}_1$ contain the signed $\text{\textsc{erv}}$ (see below). The genetic correlations, $\rho_g$ are the off-diagonal elements of $\mathbf{\breve{R}}_1$, whereas the off-diagonal elements of $\mathbf{\breve{R}}_2$ are $\rho_e$, the environmental correlations between traits. In this particular case, the components of $\mathbf{R}$, i.e., $\mathbf{R}_k$ are equivalent to $\mathbf{G}$ and $\mathbf{E}$ covariance matrices as in Almasy et al (1997).

## Relationship with the ERV

To see how the off-diagonal elements of $\mathbf{\mathring{R}}_k$ are the signed Endophenotypic Ranking Values for each of the $k$-th variance component, $\text{\textsc{erv}}_k$ (Glahn et al, 2011), note that for a pair of traits $i$ and $j$:

$\mathring{R}_{kij} = \frac{\mathsf{Cov}_k(\upsilon_i,\upsilon_j)}{\left(\mathsf{Var}(\upsilon_i)\mathsf{Var}(\upsilon_j)\right)^{1/2}}$

Multiplying both numerator and denominator by $\left(\mathsf{Var}_k(\upsilon_i)\mathsf{Var}_k(\upsilon_j)\right)^{1/2}$ and rearranging the terms gives:

$\mathring{R}_{kij} = \frac{\mathsf{Cov}_k(\upsilon_i,\upsilon_j)}{\left(\mathsf{Var}_k(\upsilon_i)\mathsf{Var}_k(\upsilon_j)\right)^{1/2}} \left(\frac{\mathsf{Var}_k(\upsilon_i)}{\mathsf{Var}(\upsilon_i)}\frac{\mathsf{Var}_k(\upsilon_j)}{\mathsf{Var}(\upsilon_j)}\right)^{1/2}$

When $\mathbf{S}_k = 2\cdot\boldsymbol{\Phi}$, the above reduces to $\mathring{R}_{kij} = \rho_k \sqrt{h^2_i h^2_j}$, which is the signed version of $\text{\textsc{erv}}=\left|\rho_g\sqrt{h_i^2h_j^2}\right|$ when $k$ is the genetic component.

## Positive-definiteness

$\mathbf{R}$ and $\mathbf{R}_k$ are covariance matrices and so, are positive-definite, whereas the correlation matrices $\mathbf{\mathring{R}}$, $\mathbf{\mathring{R}}_k$ and $\mathbf{\breve{R}}_k$ are positive-semidefinite. A hybrid matrix that does not have to be positive-definite or semidefinite is:

$\mathbf{\check{R}}_k = \mathbf{I} \odot \mathbf{\mathring{R}}_k + \left(\mathbf{J}-\mathbf{I}\right) \odot \mathbf{\breve{R}}_k$

where $\mathbf{J}$ is a matrix of ones, $\mathbf{I}$ is the identity, both of size $T \times T$, and $\odot$ is the Hadamard product. An example of such matrix of practical use is to show concisely the heritabilities for each trait in the diagonal and the genetic correlations in the off-diagonal.

## Cauchy-Schwarz

Algorithmic advantages can be obtained from the positive-definiteness of $\mathbf{\mathring{R}}_k$. The Cauchy–Schwarz theorem (Cauchy, 1821; Schwarz, 1888) states that:

$\mathring{R}_{kij} \leqslant \sqrt{\mathring{R}_{kii}\mathring{R}_{kjj}}$

Hence, the bounds for the off-diagonal elements can be known from the diagonal elements, which, by their turn, are estimated in a simpler, univariate model.

The Cauchy-Schwarz inequality imposes limits on the off-diagonal values of the matrix that contains the genetic covariances (or bivariate heritabilities).

## Parameter estimation

Under the multivariate normal assumption, the parameters can be estimated maximising the following loglikelihood function:

$\mathcal{L}\left(\mathbf{R}_k,\boldsymbol{\beta}\Big|\mathbf{y},\mathbf{\tilde{X}}\right) = -\frac{1}{2} \left(N \ln 2\pi + \ln \left|\boldsymbol{\Omega}\right| + \left(\mathbf{y}-\mathbf{\tilde{X}}\boldsymbol{\beta}\right)'\boldsymbol{\Omega}\left(\mathbf{y}-\mathbf{\tilde{X}}\boldsymbol{\beta}\right)\right)$

where $N=S \cdot T$ is the number of observations on the stacked vector $\mathbf{y}$. Unbiased estimates for $\boldsymbol{\beta}$, although inefficient and inappropriate for hypothesis testing, can be obtained with ordinary least squares (OLS).

## Parametric inference

Hypothesis testing can be performed with the likelihood ratio test (LRT), i.e., the test statistic is produced by subtracting from the loglikelihood of the model in which all the parameters are free to vary ($\mathcal{L}_1$), the loglikelihood of a model in which the parameters being tested are constrained to zero, the null model ($\mathcal{L}_0$). The statistic is given by $\lambda = 2\left(\mathcal{L}_1-\mathcal{L}_0\right)$ (Wilks, 1938), which here is asymptotically distributed as a 50:50 mixture of a $\chi^2_0$ and $\chi^2_{\text{df}}$ distributions, where df is the number of parameters being tested and free to vary in the unconstrained model (Self and Liang, 1987). From this distribution the p-values can be obtained.

## References

The photograph at the top (elephants) is by Anja Osenberg and was generously released into public domain.

# Understanding the kinship matrix

Coefficients to assess the genetic resemblance between individuals were presented in the last post. Among these, the coefficient of kinship, $\phi$, is probably the most interesting. It gives a probabilistic estimate that a random gene from a given subject $i$ is identical by descent (ibd) to a gene in the same locus from a subject $j$. For $N$ subjects, these probabilities can be assembled in a $N \times N$ matrix termed kinship matrix, usually represented as $\mathbf{\Phi}$, that has elements $\phi_{ij}$, and that can be used to model the covariance between individuals in quantitative genetics.

Consider the pedigree in the figure below, consisted of 14 subjects:

The corresponding kinship matrix, already multiplied by two to indicate expected covariances between subjects (i.e., $2\cdot\mathbf{\Phi}$), is:

Note that the diagonal elements can have values above unity, given the consanguineous mating in the family (between s09 and s12, indicated by the double line in the pedigree diagram).

In the next post, details on how the kinship matrix can be used investigate heritabilities, genetic correlations, and to perform association studies will be presented.

# Genetic resemblance between relatives

How similar?

The degree of relationship between two related individuals can be estimated by the probability that a gene in one subject is identical by descent to the corresponding gene (i.e., in the same locus) in the other. Two genes are said to be identical by descent (ibd) if both are copies of the same ancestral gene. Genes that are not ibd may still be identical through separate mutations, and be therefore identical by state (ibs), though these will not be considered in what follows.

The coefficients below were introduced by Jacquard in 1970, in a book originally published in French, and translated to English in 1974. A similar content appeared in an article by the same author in the journal Biometrics in 1972 (see the references at the end).

## Coefficients of identity

Consider a particular autosomal gene $G$. Each individual has two copies, one from paternal, another from maternal origin; these can be indicated as $G_i^P$ and $G_i^M$ for individual $i$. There are 15 exactly distinct ways (states) in which the $G$ can be identical or not identical between two individuals, as shown in the figure below.

To each of these states $S_{1, \ldots , 15}$, a respective probability $\delta_{1, \ldots , 15}$ can be assigned; these are called coefficients of identity by descent. These probabilities can be calculated at every generation following very elementary rules. For most problems, however, the distinction between paternal and maternal origin of a gene is irrelevant, and some of the above states are equivalent to others. If these are condensed, we can retain 9 distinct ways, shown in the figure below:

As before, to each of these states $\Sigma_{1, \ldots , 9}$, a respective probability $\Delta_{1, \ldots , 9}$ can be assigned; these are called condensed coefficients of identity by descent, and relate to the former as:

• $\Delta_1 = \delta_1$
• $\Delta_2 = \delta_6$
• $\Delta_3 = \delta_2 + \delta_3$
• $\Delta_4 = \delta_7$
• $\Delta_5 = \delta_4 + \delta_5$
• $\Delta_6 = \delta_8$
• $\Delta_7 = \delta_9 + \delta_{12}$
• $\Delta_8 = \delta_{10} + \delta_{11} + \delta_{13} + \delta_{14}$
• $\Delta_9 = \delta_{15}$

A similar method was proposed by Cotterman (1940), in his highly influential but only much later published doctoral thesis. The $\Delta_9$, $\Delta_8$ and $\Delta_7$ correspond to his coefficients $k_0$, $k_1$ and $k_2$.

## Coefficient of kinship

The above refer to probabilities of finding particular genes as identical among subjects. However, a different coefficient can be defined for random genes: the probability that a random gene from subject $i$ is identical with a gene at the same locus from subject $j$ is the coefficient of kinship, and can be represented as $\phi_{ij}$:

• $\phi_{ij} = \Delta_1 + \frac{1}{2}(\Delta_3 + \Delta_5 + \Delta_7) + \frac{1}{4}\Delta_8$

If $i$ and $j$ are in fact the same individual, then $\phi_{ii}$ is the kinship of a subject with himself. Two genes taken from the same individual can either be the same gene (probability $\frac{1}{2}$ of being the same) or be the genes inherited from father and mother, in which case the probability is given by the coefficient of kinship between the parents. In other words, $\phi_{ii} = \frac{1}{2} + \frac{1}{2}\phi_{\text{FM}}$. If both parents are unrelated, $\phi_{\text{FM}}=0$, such that the kinship of a subject with himself is $\phi_{ii} = \frac{1}{2}$.

The value of $\phi_{ij}$ can be determined from the number of generations up to a common ancestor $k$. A random gene from individual $i$ can be identical to a random gene from individual $j$ in the same locus if both comes from the common ancestor $k$, an event that can happen if either (1) both are copies of the gene in $k$, or (2) if they are copies of different genes in $k$, but $k$ is inbred; this has probability $\frac{1}{2}f_k$ (see below about the coefficient of inbreeding, $f$). Thus, if there are $m$ generations between $i$ and $k$, and $n$ generations between $j$ and $k$, the coefficient of kinship can be computed as $\phi_{ij} = \left(\frac{1}{2}\right)^{m+n+1}(1+f_k)$. If $i$ and $j$ can have more than one common ancestor, then there are more than one line of descent possible, and the kinship is determined by integrating over all such possible $K$ common ancestors:

• $\phi_{ij} = \sum_{k=1}^K \left(\frac{1}{2}\right)^{m_k+n_k+1}(1+f_k)$

For a set of subjects, the pairwise coefficients of kinship $\phi_{ij}$ can be arranged in a square matrix $\boldsymbol{\Phi}$, and used to model the covariance between subjects as $2\cdot\boldsymbol{\Phi}$ (see here).

## Coefficient of inbreeding

The coefficient of inbreeding $f$ of a given subject $i$ is the coefficient of kinship between their parents. While the above coefficients provide information about pairs of individuals, the coefficient of inbreeding gives information about a particular subject. Yet, $f_i$ can be computed from the coefficients of identity:

• $f_{i} = \Delta_1 + \Delta_2 + \Delta_3 + \Delta_4$
• $f_{j} = \Delta_1 + \Delta_2 + \Delta_5 + \Delta_6$

Note that all these coefficients are based on probabilities, but it is now possible to identify the actual presence of a particular gene using marker data. Also note that while the illustrations above suggest application to livestock, the same applies to studies of human populations.

## Some particular cases

The computation of the above coefficients can be done using algorithms, and are done automatically by software that allow analyses of pedigree data, such as solar. Some common particular cases are shown below:

Relationship $\Delta_1$ $\Delta_2$ $\Delta_3$ $\Delta_4$ $\Delta_5$ $\Delta_6$ $\Delta_7$ $\Delta_8$ $\Delta_9$ $\phi_{ij}$
Self $0$ $0$ $0$ $0$ $0$ $0$ $1$ $0$ $0$ $\frac{1}{2}$
Parent-offspring $0$ $0$ $0$ $0$ $0$ $0$ $0$ $1$ $0$ $\frac{1}{4}$
Half sibs $0$ $0$ $0$ $0$ $0$ $0$ $0$ $\frac{1}{2}$ $\frac{1}{2}$ $\frac{1}{8}$
Full sibs/dizygotic twins $0$ $0$ $0$ $0$ $0$ $0$ $\frac{1}{4}$ $\frac{1}{2}$ $\frac{1}{4}$ $\frac{1}{4}$
Monozygotic twins $0$ $0$ $0$ $0$ $0$ $0$ $1$ $0$ $0$ $\frac{1}{2}$
First cousins $0$ $0$ $0$ $0$ $0$ $0$ $0$ $\frac{1}{4}$ $\frac{3}{4}$ $\frac{1}{16}$
Double first cousins $0$ $0$ $0$ $0$ $0$ $0$ $\frac{1}{16}$ $\frac{6}{16}$ $\frac{9}{16}$ $\frac{1}{8}$
Second cousins $0$ $0$ $0$ $0$ $0$ $0$ $0$ $\frac{1}{16}$ $\frac{15}{16}$ $\frac{1}{64}$
Uncle-nephew $0$ $0$ $0$ $0$ $0$ $0$ $0$ $\frac{1}{2}$ $\frac{1}{2}$ $\frac{1}{8}$
Offspring of sib-matings $\frac{1}{16}$ $\frac{1}{32}$ $\frac{1}{8}$ $\frac{1}{32}$ $\frac{1}{8}$ $\frac{1}{32}$ $\frac{7}{32}$ $\frac{5}{16}$ $\frac{1}{16}$ $\frac{3}{8}$

## References

• Cotterman C. A calculus for statistico-genetics. 1940. PhD Thesis. Ohio State University.
• Jacquard, A. Structures génétiques des populations. Masson, Paris, France, 1970, later translated and republished as Jacquard, A. The genetic structure of populations. Springer, Heidelberg, 1974.
• Jacquard A. Genetic information given by a relative. Biometrics. 1972;28(4):1101-1114.

The photograph at the top (sheep) is in public domain.

# The NIFTI-2 file format

In a previous post, the nifti-1 file format was presented. An update of this format has recently been produced by the Data Format Working Group (dfwg). The updated version retains generally the same amount of information as the previous, with the crucial difference that it allows far more datapoints in each dimension, thus permitting that the same overall file structure is used to store, for instance, surface-based scalar data, or large connectivity matrices. Neither of these had originally been intended at the time the analyze or the nifti-1 formats were developed. While packages as FreeSurfer developed their own formats for surface-based scalar data, a more general solution was still pending.

## Compatible, but not as before

Users who participated of the transition from analyze to nifti-1 may remember that the same libraries used to read analyze would read nifti, perhaps with a few minor difficulties, but the bulk of the actual data would be read by most analyze-compliant applications. This was possible because a large part of the relevant information in the header was kept exactly in the same position counted from the beginning of the file. An application could read information at a given byte position and locate it without error, or without finding something else.

This time things are different. While a large degree of compatibility exists, this compatibility helps more the developer than the end user. If before, an application made to read only analyze could read nifti-1, this time an application made to read nifti-1 will not read nifti-2 without a few, even if minor, changes to the application source code. To put in other words, the new version of the format is not bitwise compatible with the previous one. The reasons for this “almost compatibility” will become clear below.

## Changing types

The limitation that became evident with the new uses found for the nifti format refer particularly to the maximum number of points (e.g., voxels) in each dimension. This limitation stems from the field short dim[8], which allows only 2 bytes (16 bits) for each dimension; since only positive values are accepted (short is signed), this imposes a cap: no more than 215-1 = 32767 voxels per dimension. In the nifti-2 format, this was replaced by int64_t dim[8], which guarantees 8 bytes (64 bits) per dimension, and so, a much larger number of points per dimension, that is, 263-1 = 9,223,372,036,854,775,807.

This change alone already renders the nifti-2 not bitwise compatible with the nifti-1. Yet, other changes were made, some as a consequence of the modifications to dim[8], such as slice_start and slice_end, both too promoted from short to int64_t. Other changes were made so as to improve the general ability to store data with higher precision. A complete table listing the modifications of the field types is below:

NIFTI-1 NIFTI-2
short dim[8] int64_t dim[8]
float intent_p1 double intent_p1
float intent_p2 double intent_p2
float intent_p3 double intent_p3
float pixdim[8] double pixdim[8]
float vox_offset int64_t vox_offset
float scl_slope double scl_slope
float scl_inter double scl_inter
float cal_max double cal_max
float cal_min double cal_min
float slice_duration double slice_duration
float toffset double toffset
short slice_start int64_t slice_start
short slice_end int64_t slice_end
char slice_code int32_t slice_code
char xyzt_units int32_t xyzt_units
short intent_code int32_t intent_code
short qform_code int32_t qform_code
short sform_code int32_t sform_code
float quatern_b double quatern_b
float quatern_c double quatern_c
float quatern_d double quatern_d
float srow_x double srow_x
float srow_y double srow_y
float srow_z double srow_z
char magic[4] char magic[8]

## Fields removed, fields reordered, fields added

Seven fields that only existed in the nifti-1 for compatibility with the old analyze format were removed entirely. These are:

• char data_type[10]
• char db_name[18]
• int extents
• short session_error
• char regular
• int glmax
• int glmax

Another change is that the fields were reordered, which is an improvement over the nifti-1: the magic string, for instance, is now at the beginning of the file, which helps testing what kind of file it is. All constraints that were imposed on the nifti-1 to allow compatibility with the analyze were finally dropped. At the far end of the header, a field with 15 bytes was included for padding the header to a total size of 540, and to ensure 16-byte alignment after the 4 final bytes that indicate extra information are included.

## Overview of the new header structure

With the modifications above, the overall structure of the he nifti-2 became:

Type Name Offset Size Description
int sizeof_hdr 0B 4B Size of the header. Must be 540 (bytes).
char magic[8] 4B 8B Magic string, defining a valid signature.
int16_t data_type 12B 2B Data type.
int16_t bitpix 14B 2B Number of bits per voxel.
int64_t dim[8] 16B 64B Data array dimensions.
double intent_p1 80B 8B 1st intent parameter.
double intent_p2 88B 8B 2nd intent parameter.
double intent_p3 96B 8B 3rd intent parameter.
double pixdim[8] 104B 64B Grid spacings (unit per dimension).
int64_t vox_offset 168B 8B Offset into a .nii file.
double scl_slope 176B 8B Data scaling, slope.
double scl_inter 184B 8B Data scaling, offset.
double cal_max 192B 8B Maximum display intensity.
double cal_min 200B 8B Minimum display intensity.
double slice_duration 208B 8B Time for one slice.
double toffset 216B 8B Time axis shift.
int64_t slice_start 224B 8B First slice index.
int64_t slice_end 232B 8B Last slice index.
char descrip[80] 240B 80B Any text.
char aux_file[24] 320B 24B Auxiliary filename.
int qform_code 344B 4B Use the quaternion fields.
int sform_code 348B 4B Use of the affine fields.
double quatern_b 352B 8B Quaternion b parameter.
double quatern_c 360B 8B Quaternion c parameter.
double quatern_d 368B 8B Quaternion d parameter.
double qoffset_x 376B 8B Quaternion x shift.
double qoffset_y 384B 8B Quaternion y shift.
double qoffset_z 392B 8B Quaternion z shift.
double srow_x[4] 400B 32B 1st row affine transform.
double srow_y[4] 432B 32B 2nd row affine transform.
double srow_z[4] 464B 32B 3rd row affine transform.
int slice_code 496B 4B Slice timing order.
int xyzt_units 500B 4B Units of pixdim[1..4].
int intent_code 504B 4B nifti intent.
char intent_name[16] 508B 16B Name or meaning of the data.
char dim_info 524B 1B Encoding directions.
char unused_str[15] 525B 15B Unused, to be padded with with zeroes.
Total size 540B

## NIFTI-1 or NIFTI-2?

For the developer writing input/output functions to handle nifti files, a simple check can be used to test the version and the endianness of the file: the first four bytes are read (int sizeof_hdr): if equal to 348, then it is a nifti-1 file; if equal to 540, then it is a nifti-2 file. If equal to neither, then swap the bytes, as if reading in the non-native endianness, and repeat the test; if this time the size of the header is found as 348 or 540, the version is determined, and this also implies that all bytes in the file need to be swapped to match the endianness of the current architecture. If, however, the first four bytes do not contain 348 or 540 in either endianness, then it is not a valid nifti file.

Once the version and the endianness have been determined, if it is a nifti-1 file, jump to byte 344 and check if the content is 'ni1' (or '6E 69 31 00' in hexadecimal), indicating a pair .hdr/.img, or if it is 'n+1' ('6E 2B 31 00'), indicating a single .nii file. If, however, it is a nifti-2 file, just read the next 8 bytes and check if the content is 'n+2' ('6E 2B 32 00') followed by '0D 0A 1A 0A' (hex).

## Storing extra information

Just like the nifti-1, the four bytes after the end of the nifti-2 header are used to indicate extensions and more information. Thus, the actual data begins after the byte 544. See the post on the nifti-1 for details. The cifti-2 file format (used extensively by the Human Connectome Project) is built on top of the nifti-2 format, and uses this extra information.

The official definition of the nifti-2 format is available as a c header file (nifti2.h) here and mirrored here.