Very often we find ourselves handling large matrices and, for multiple different reasons, may want or need to reduce the dimensionality of the data by retaining the most relevant principal components. A question that arises then is how many such principal components should be retained? That is, what are the principal components that provide information that can be distinguished from variability that is indistinguishable from random noise?

One simple heuristic method is the scree plot (Cattell, 1966): one computes the singular values of the matrix, plots them in descending order, and visually looks for some kind of “elbow” (or the starting point of the “scree”) to be used as a threshold. Those singular vectors (principal components) that have corresponding singular values larger than that threshold are retained, otherwise discarded. Unfortunately, the scree test is too subjective, and worse, many realistic problems can produce matrices that have multiple such “elbows”.

Many other methods to help select the number of principal components to retain exist. Some, for example, are based on information theory, or on probabilistic formulations, or on hard thresholds related the size of the matrix or the (explained) variance from the data. In this article, the little known **Wachter method** (Wachter, 1976) is presented.

Let be a matrix of random values with zero mean and variance . The density of the bulk spectrum of the singular values , , , of is:

and a point mass at the origin if , where , , and (Bai, 1999). This is the **Marčenko and Pastur (1967)** distribution. The cumulative distribution function (cdf) can be obtained by integration, i.e., . MATLAB/Octave functions that compute the cdf and its inverse are here: **mpcdf.m** and **mpinv.m**.

If we know the probability of finding a singular value or larger for a random matrix, we have the means to judge whether that is sufficiently large to be considered interesting enough for its corresponding singular vector (principal component) to be retained. Simply computing the p-value, however, is not enough, because the distribution makes no reference to the position of the sorted singular values. A different procedure needs be considered.

The core idea of the Wachter method is to use a QQ plot of the observed singular values versus the quantiles obtained from the inverse cdf of the Marčenko–Pastur distribution, and use eventual deviations from the identity line to help finding the threshold that separates the “good” from the “unnecessary” principal components. That is, plot the eigenvalues versus the quantiles . Deviations from the identity line are evidence for excess variance not expected from random variables alone.

The function **wachter.m** computes the singular values from a given observed matrix , the expected singular values should be a random matrix, as well as the p-values for each singular value in either case. The observed and expected singular values can be used to built a QQ-plot. The p-values can be used for a comparison, e.g., via the logarithm of their ratio. For example:

```
% For reproducibility, reset the random number generator.
% Use "rand" instead of "rng" to ensure compatibility with old versions.
rand('seed',0);
% Simulate data. See Gavish and Donoho (2014) for this example.
X = diag([1.7 2.5 zeros(1,98)]) + randn(100)*sqrt(.01);
% Compute the expected and observed singular values,
% as well as the respective cumulative probabilities (p-values).
% See the help of wachter.m for syntax.
[Exp,Obs,Pexp,Pobs] = wachter(X,[],false,true);
% Log of the ratio between observed and expected p-values.
% Large values are evidence for "good" components.
P_ratio = -log10(Pobs./Pexp);
% Plot the spectrum.
subplot(1,2,1);
plot(Obs,'x:');
title('Singular values');
xlabel('index (k)');
% Construct the QQ-plot.
subplot(1,2,2);
qqplot(Exp,Obs);
title('QQ plot');
```

The result is:

In this example, two of the singular values can be considered “good”, and should be retained. The others can, according to this criterion, be dropped.

The function can take into account nuisance variables (provided in the 2nd argument), including intercept (for mean-centering), allows normalization of columns to unit variance, and can operate on p-values (upper tail from the density function) or on the cumulative probabilities. See the help text inside the function for usage information.

- Bai ZD.
**Methodologies in spectral analysis of large dimensional random matrices, a review.***Statistica Sinica*. 1999;9(3):611–77. - Cattell RB.
**The scree test for the number of factors.***Multivariate Behavioral Research*. 1966;1(2):245–276. - Gavish M, Donoho DL.
**The optimal hard threshold for singular values is 4/sqrt(3).***arXiv:13055870*. 2014. - Johnstone IM.
**On the distribution of the largest eigenvalue in principal components analysis.***The Annals of Statistics*. 2001;29(2):295–327. - Marčenko VA, Pastur LA.
**Distribution of eigenvalues for some sets of random matrices.***Math USSR Sb*. 1967;1(4):457–83. - Wachter KW.
**Probability plotting points for principal components.***In:*Proceedings of the Ninth Interface Symposium on Computer Science and Statistics. Harvard University and Massachussetts Institute of Technology: Prindle, Weber & Schmidt; 1976. p. 299–308.

*The image at the top is of the Drei Zinnen, in the Italian Alps during the Summer, in which a steep slope scattered with small stones (scree) is visible. Photo by Heinz Melion from Pixabay.*

where is a matrix of observed variables, is a matrix of predictors of interest, is a matrix of covariates (of no interest), and is a matrix of the same size as with the residuals.

Because the interest is in testing the relationship between and , in principle it would be these that would need be permuted, but doing so also breaks the relationship with , which would be undesirable. Over the years, many methods have been proposed. A review can be found in Winkler et al. (2014); other previous work include the papers by Anderson and Legendre (1999) and Anderson and Robinson (2001).

One of these various methods is the one published in Freedman and Lane (1983), which consists of permuting data that has been residualised with respect to the covariates, then estimated covariate effects added back, then the full model fitted again. The procedure can be performed through the following steps:

- Regress against the full model that contains both the effects of interest and the nuisance variables, i.e., . Use the estimated parameters to compute the statistic of interest, and call this statistic .
- Regress against a reduced model that contains only the covariates, i.e. , obtaining estimated parameters and estimated residuals .
- Compute a set of permuted data . This is done by pre-multiplying the residuals from the reduced model produced in the previous step, , by a permutation matrix, , then adding back the estimated nuisance effects, i.e. .
- Regress the permuted data against the full model, i.e.
- Use the estimated to compute the statistic of interest. Call this statistic .
- Repeat the Steps 2-4 many times to build the reference distribution of under the null hypothesis of no association between and .
- Count how many times was found to be equal to or larger than , and divide the count by the number of permutations; the result is the p-value.

Steps 1-4 can be written concisely as:

where is a permutation matrix (for the -th permutation, is the hat matrix due to the covariates, and is the residual forming matrix; the superscript symbol represents a matrix pseudo-inverse.

In page 385 of Winkler et al. (2014), my colleagues and I state that:

[…] add the nuisance variables back in Step 3 is not strictly necessary, and the model can be expressed simply as , implying that the permutations can actually be performed just by permuting the rows of the residual-forming matrix .

However, in the paper we do not offer any proof of this important result, that allows algorithmic acceleration. Here we remedy that. Let’s start with two brief lemmata:

**Lemma 1:** The product of a hat matrix and its corresponding residual-forming matrix is zero, that is, .

This is because , hence since is idempotent.

**Lemma 2 (Frisch–Waugh–Lovell theorem):** Given a GLM expressed as , we can estimate from an equivalent GLM written as .

To see why, remember that multiplying both sides of an equation by the same factor does not change it (least squares solutions may change; transformations using Lemma 2 below do not act on the fitted model). Let’s start from:

Then remove the parentheses:

Since :

and that :

Since :

where .

**Main result**

Now we are ready for the main result. The Freedman-Lane model is:

Per Lemma 2, it can be rewritten as:

Dropping the parenthesis:

Per Lemma 1:

What is left has the same form as the result of Lemma 2. Thus, reversing it, we obtain the final result:

Hence, the hat matrix cancels out, meaning that it is not necessary. Results are the same both ways.

- Anderson MJ, Legendre P. An empirical comparison of permutation methods for tests of partial regression coefficients in a linear model.
*Journal of Statistical Computation and Simulation*1999;62(3):271–303. - Anderson MJ, Robinson J. Permutation tests for linear models.
*Australian & New Zealand Journal of Statistics of Statistics*2001;43(1):75–88. - Freedman D, Lane D. A Nonstochastic Interpretation of Reported Significance Levels.
*Journal of Business & Economic Statistics*1983;1(4):292. - Winkler AM, Ridgway GR, Webster MA, Smith SM, Nichols TE. Permutation inference for the general linear model.
*NeuroImage*2014;92:381–97.

**Updates**

**07.Jun.2020:**A graphical representation of Lemma 1 can be found in p.40 of Jaromil Frossard PhD thesis (Université de Genève), available**here**.**07.Jun.2020:**Lemma 2 corresponds to the Frisch–Waugh–Lovell theorem. Thanks to Samuel Davenport (University of Oxford) for pointing out.

We often think of statistics as a way to summarize large amounts of data. For example, we can collect data from thousands of subjects, and extract a single number that tells something about these subjects. The well known German tank problem shows that, in a certain way, statistics can also be used for the opposite: using incomplete data and a few reasonable assumptions (or real knowledge), statistics provides way to estimate information that offer a panoramic view of *all* the data. Historical problems are interesting on their own. Yet, it is not always that we see so clearly consequential historical events at the time they happen — like now.

In the Second World War, as in any other war, information could be more valuable than anything else. Intelligence reports (such as from spies) would feed the Allies with information about the industrial capacity of Nazi Germany, including details about things such as the number of tanks produced. This kind of information can have far reaching impact and not only determine the outcome of a battle, but also if a battle would even even happen or with what preparations, as the prospect of finding a militarily superior opponent is often a great deterrent.

Sometimes, German tanks, as the well known Panzer, could be captured and carefully inspected. Among the details noted were the serial number printed in various pieces, such as chassis, gearboxes, and the serial numbers of the moulds used to produce the wheels. With the serial number of even a single chassis, for example one can estimate the total number of tanks produced; knowing the serial number of a single wheel mould allows the estimation of the total number of moulds, and thus, how many wheels can be produced in a certain amount of time. But how?

If serial numbers are indeed serial, e.g., , growing uniformly and without gaps, and we see a tank that has a serial number , then clearly at least tanks must have been produced. But could we have a better guess?

Let’s start by reversing the problem: suppose we knew . In that case, what would be the average value of the serial numbers of all tanks? The average for uniformly distributed data like this would be , that is, the average of the first and last serial numbers.

Now, say we have only one sighting of a tank, and that has serial number . Then our best guess for the average serial number is itself, as we have no additional information. Thus, with , our guess would be (that is, reorganizing the terms of the previous equation for ). Note that, for one sighting, this formula guarantees that is larger or equal than , which makes sense: we cannot have an estimate for that is smaller than the serial number itself.

What if we had not just one, but multiple sightings? Call the number of sightings . The mean is now , for ordered serial numbers . Clearly, we can’t use the same formula, because if is much smaller than (say, because we have seen many small serial numbers, but just a handful of larger ones), could incorrectly be estimated as less than , which makes no sense. At least tanks must exist.

While incorrect for , the above formula gives invaluable insight: it shows that for such uniformly distributed data, approximately half of the tanks have serial number above , the other half below . Extending the idea, and still under the assumption that the serial numbers are uniform, we can conclude that the number of tanks below the lowest serial number (which is ) must be approximately the same as the (unknown) number of tanks above the highest serial number . So, a next better estimate could be to use .

We can still do better, though. Since we have sightings, we can estimate what is the average interval between sightings, i.e., . As it is based on all sightings, this gives a better estimate of the spacing between the serial numbers than the single sighting . The result can be added to . The final estimate then becomes .

To make this concrete, say we saw tanks numbered . Then our best guess would be .

At the end of the war, estimates obtained using the above method proved remarkably accurate, much more so than information provided by spies and other intelligence reports.

Let’s now see a similar example that is contemporary to us. Take the current pandemic caused by a novel coronavirus. The World Health Organization stated officially, in 14th January 2020, when there were 41 cases officially reported in China, that there was no evidence for human-to-human transmission. Yet, when the first 3 cases outside China were confirmed in 16th January 2020, epidemiologists at the Imperial College London were quick to find out that the WHO statement must have not been true. Rather, the real number of cases was likely well above 1700.

How did they make that estimate? The key insight was the realisation that only a small number of people in any major city travels internationally, particularly in such a short time span like that given by the time until the onset of symptoms for this kind of respiratory disease. If one can estimate prevalence among those who travelled, that would be a good approximation to the prevalence among those who live in the city, assuming that those who travel are an unbiased sample of the population.

Following this idea, we have: , that is, the number of cases among those who travelled () divided by the total number of people who travelled () is expected to be approximately the same as the number of cases among those who stayed () divided by the total number of people who stayed (live) in the city ().

The number of people served by the international airport of Wuhan is about 19 million (the size of the Wuhan metropolitan area), and the average daily number of outbound international passengers in previous years was 3301 for that time of the year (a figure publicly known, from IATA). Unfortunately, little was known outside China about the time taken between exposure to the virus and the onset of symptoms. The researchers then resorted to a proxy: the time known for the related severe respiratory disorder known as MERS, also caused by a coronavirus, which is about 10 days. Thus, we can estimate people travelling out, and staying in the city. The number of known international cases was at the time . Hence:

cases

So, using remarkably simple maths, simpler even than in our WWII German tank example, the scientists estimated that the number of actual cases in the city of Wuhan was likely far above the official figure of 41 cases. The researchers were careful to indicate that, should the probability of travelling be higher among those exposed, the number of actual cases could be smaller. The converse is true: should travellers be wealthier (thus less likely to be exposed to a possible zoonosis as initially reported), the number of actual cases could be higher.

Importantly, it is not at all likely that 1700 people would have contracted such a zoonosis from wild animals in a dense urban area like Wuhan, hence human-to-human transmission *must* had been occurring. Eventually the WHO confirmed human-to-human transmission on 19th January 2020. Two days later, Chinese authorities began locking down and sealing off Wuhan, thus putting into place a plan to curb the transmission.

To find out more about the original problem of the number of tanks, and also for other methods of estimation for the same problem, a good start is this article. Also invaluable, for various estimation problems related to the fast dissemination of the novel coronavirus, are all the reports by the epidemiology team at the Imperial College London, which can be found here.

]]>Motivated by this perceived difficulty in the interpretation of results, **Stewart and Love (1968)** proposed the computation of what has been termed a **redundancy index**. It works as follows.

Let and be two sets of variables over which CCA is computed. We find *canonical coefficients* and , such that the *canonical variables* and have maximal, diagonal correlation structure; this diagonal contains the ordered *canonical correlations* .

Now that CCA has been computed, we can find the correlations between the original variables and the canonical coefficients. Let and be such correlations, which are termed *canonical loadings* or *structure coefficients*. Now compute the mean square for each of the columns of and . These represent the variance extracted by the corresponding canonical variable. That is:

- Variance extracted by canonical variable :
- Variance extracted by canonical variable :

These quantities represent the mean variance extracted from the original variables by each of the canonical variables (in each side).

Compute now the proportion of variance of one canonical variable (say, ) explained by the corresponding canonical variable in the other side (say, ). This is given simply by , the usual coefficient of determination.

The redundancy index for each canonical variable is then the product of and for the left side of CCA, and the product of and for the right side. That is, the index is not symmetric. It measures the proportion of variance in one of the two set of variables explained by the correlation between the -th pair of canonical variables.

The sum of the redundancies for all canonical variables in one side or another forms a global redundancy metric, which indicates the proportion of the variance in a given side explained by the variance in the other.

This global redundancy can be scaled to unity, such that the redundancies for each of the canonical variables in a give side can be interpreted as the proportion of total redundancy.

If you follow the original paper by Stewart and Love (1968), and are column III of Table 2, the redundancy of each canonical variable for each side corresponds to column IV, and the proportion of total redundancy is in column V.

Another reference on the same topic that is worth looking is Miller (1981). In it, the author discusses that redundancy is somewhere in between CCA itself (fully symmetric) and multiple regression (fully asymmetric).

- Hotelling H.
**Relations between two sets of variates.***Biometrika*. 1936;28(3/4):321–77. - Muller KE.
**Relationships between redundancy analysis, canonical correlation, and multivariate regression.***Psychometrika*. 1981;46(2):139–42. - Stewart D, Love W.
**A general canonical correlation index.***Psychological Bulletin*. 1968;70(3, Pt.1):160–3. (unfortunately, the paper is paywalled; write to APA to complain).

**Update**

**28.Jun.2020:**A script that computes the redundancy indices is available here:**redundancy.m**(work with Thomas Wassenaar, University of Oxford).

**PALM** — Pemutation Analysis of Linear Models — uses either MATLAB or Octave behind the scenes. It can be executed from within either of these environments, or from the shell, in which case either of these is invoked, depending on how PALM was configured.

For users who do not want or cannot spend thousands of dollars in MATLAB licenses, Octave comes for free, and offers quite much the same benefits. However, for Octave, some functionalities in PALM require version 4.4.1 or higher. However, stable Linux distributions such as Red Hat Enterprise Linux and related (such as CentOS and Scientific Linux) still offer only 3.8.2 in the official repositories at the time of this writing, leaving the user with the task of finding an unofficial package or compiling from the source. The latter task, however, can be daunting: Octave is notoriously difficult to compile, with a myriad of dependencies.

A much simpler approach is to use Flatpak or Snappy. These are systems for distribution of Linux applications. Snappy is sponsored by Canonical (that maintains Ubuntu), whereas Flatpak appears to be the preferred tool for Fedora and openSUSE. Using either system is quite simple, and here the focus is on **Flatpak**.

To have a working installation of Octave, all that needs be done is:

**1) Make sure Flatpak is installed:**

On a RHEL/CentOS system, use (as root):

yum install flatpak

For openSUSE, use (as root):

zypper install flatpak

For Ubuntu and other Debian-based systems:

sudo apt install flatpak

**2) Add the Flathub repository:**

flatpak remote-add --if-not-exists flathub https://flathub.org/repo/flathub.flatpakrepo

**3) Install Octave:**

flatpak install flathub org.octave.Octave

**4) Run it!**

flatpak run org.octave.Octave

Only the installation of Flatpak needs be done as root. Once it has been installed, repositories and applications (such as Octave, among many others) can be installed at the user level. These can also be installed and made available system-wide (if run as root).

From version alpha117 onwards, the executable file ‘palm’ (not to be confused with ‘palm.m’) will include a variable named “OCTAVEBIN”, which specifies how Octave should be called. Change it from the default:

OCTAVEBIN=/usr/bin/octave

so that it invokes the version installed with Flatpak:

OCTAVEBIN="/usr/bin/flatpak run org.octave.Octave"

After making the above edits, it should be possible to run PALM directly from the system shell using the version installed via Flatpak. Alternatively, it should also be possible to invoke Octave (as in step 4 above), then use the command “addpath” to specify the location of palm.m, and then call PALM from the Octave prompt.

Handling of packages when Octave is installed via Flatpak is the same as usual, that is, via the command ‘pkg’ run from within Octave. More details here.

]]>**NiDB** is a light, powerful, and simple to use neuroimaging database. One of its main strengths is that it was developed using a stack of stable and proven technologies: Linux, Apache, MySQL/MariaDB, PHP, and Perl. None of these technologies are new, and the fact that they have been around for so many years means that there is a lot of documentation and literature available, as well as a myriad of libraries (for PHP and Perl) that can do virtually anything. Although both PHP and Perl have received some degree of criticism (not unreasonably), and in some cases are being replaced by tools such as Node.js and Python, the volume of information about them means it is easy to find solutions when problems appear.

This article covers installation steps for either CentOS or RHEL 7, but similar steps should work with other distributions since the overall strategy is the same. By separating apart each of the steps, as opposed to doing all the configuration and installation as a single script, it becomes easier to adapt to different systems, and to identify and correct problems that may arise due to local particularities. The steps below are derived from the scripts `setup-centos7.sh`

and `setup-ubuntu16.sh`

, that are available in the NiDB repository, but here these will be ignored. Note that the instructions below are not “official”; for the latter, consult the NiDB documentation. The intent of this article is to facilitate the process and mitigate some frustration you may feel if trying to do it all by yourself. Also, by looking at the installation steps, you should be able to have a broad overview of the pieces that constitute database.

If installing CentOS from the minimal DVD, choose a “Minimal Install” and leave to add the desktop in the next step.

This is a good time to install the most recent updates and patches, and reboot if the updates include a new kernel:

yum update /sbin/reboot

While not strictly necessary, having a graphical interface for a web-based application will be handy. Install your favourite desktop, and a VNC server if you intend to manage the system remotely. For a lightweight desktop, consider MATE:

First add the EPEL repository. Depending on what you already have configured, use either:

yum install epel-release

or:

yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

Then:

yum groupinstall "MATE Desktop" systemctl set-default graphical.target systemctl isolate graphical.target systemctl enable lightdm systemctl start lightdm

For VNC, there are various options available. Consider, for example, TurboVNC.

These will help when entering the commands later.

# Directory where NiDB will be installed NIDBROOT=/nidb # Directory of the webpages and PHP files: WWWROOT=/var/www/html # Linux username under which NiDB will run: NIDBUSER=nidb # MySQL/MariaDB root password: MYSQLROOTPASS=[YOUR_PASSWORD_HERE] # MySQL/MariaDB username that will have access to the database, and associated password: MYSQLUSER=nidb MYSQLPASS=[YOUR_PASSWORD_HERE]

These variables are only used during the installation, and all the steps here are done as root. Considering clearing your shell history at the end, so as not to have your passwords stored there.

This is the user that will run the processes related to the database. It is not necessary that this user has administrative privileges on the system, and from a security perspective, it is better if not.

useradd -m ${NIDBUSER} passwd ${NIDBUSER} # choose a sensible password

Add the repository for a more recent version, then install:

yum install httpd

Configure it to run as the `${NIDBUSER}`

user:

sed -i "s/User apache/User ${NIDBUSER}/" /etc/httpd/conf/httpd.conf sed -i "s/Group apache/Group ${NIDBUSER}/" /etc/httpd/conf/httpd.conf

Enable it at boot, and also start it now:

systemctl enable httpd.service systemctl start httpd.service

Open the relevant ports in the firewall, then reload the rules:

firewall-cmd --permanent --add-port=80/tcp firewall-cmd --permanent --add-port=443/tcp firewall-cmd --reload

For MariaDB 10.2, the repository can be added to `/etc/yum.repos.d/`

as:

echo "[mariadb] name = MariaDB baseurl = http://yum.mariadb.org/10.2/centos7-amd64 gpgkey = https://yum.mariadb.org/RPM-GPG-KEY-MariaDB gpgcheck = 1" >> /etc/yum.repos.d/MariaDB.repo

For other versions or distributions, visit this address. Then do the actual installation:

yum install MariaDB-server MariaDB-client

Enable it at boot and start now too:

systemctl enable mariadb.service systemctl start mariadb.service

Secure the MySQL/MariaDB installation:

mysql_secure_installation

Pay attention to the questions on the root password and set it here to what was chosen in the `${MYSQLROOTPASS}`

variable. Make sure your database is secure.

First add the repositories for PHP 7.2:

yum install http://rpms.remirepo.net/enterprise/remi-release-7.rpm yum install yum-utils yum-config-manager --enable remi-php72 yum install php php-mysql php-gd php-process php-pear php-mcrypt php-mbstring

Install some additional PHP packages:

pear install Mail pear install Mail_Mime pear install Net_SMTP

Edit the PHP configuration:

sed -i 's/^short_open_tag = .*/short_open_tag = On/g' /etc/php.ini sed -i 's/^session.gc_maxlifetime = .*/session.gc_maxlifetime = 28800/g' /etc/php.ini sed -i 's/^memory_limit = .*/memory_limit = 5000M/g' /etc/php.ini sed -i 's/^upload_tmp_dir = .*/upload_tmp_dir = \/${NIDBROOT}\/uploadtmp/g' /etc/php.ini sed -i 's/^upload_max_filesize = .*/upload_max_filesize = 5000M/g' /etc/php.ini sed -i 's/^max_file_uploads = .*/max_file_uploads = 1000/g' /etc/php.ini sed -i 's/^max_input_time = .*/max_input_time = 600/g' /etc/php.ini sed -i 's/^max_execution_time = .*/max_execution_time = 600/g' /etc/php.ini sed -i 's/^post_max_size = .*/post_max_size = 5000M/g' /etc/php.ini sed -i 's/^display_errors = .*/display_errors = On/g' /etc/php.ini sed -i 's/^error_reporting = .*/error_reporting = E_ALL \& \~E_DEPRECATED \& \~E_STRICT \& \~E_NOTICE/' /etc/php.ini

Also, edit `/etc/php.ini`

to make sure your timezone is correct, for example:

date.timezone = America/New_York

For a list of time zones, **see here**. Finally:

chown -R ${NIDBUSER}:${NIDBUSER} /var/lib/php/session

These are all in the main repositories already added so you should be able to simply run:

yum install perl* cpan git gcc gcc-c++ java ImageMagick vim libpng12 libmng wget iptraf* pv

Install also various Perl packages from CPAN. The first time you run `cpan`

, various configuration questions will be asked; it is safe to accept default answers for all:

cpan File::Path cpan Net::SMTP::TLS cpan List::Util cpan Date::Parse cpan Image::ExifTool cpan String::CRC32 cpan Date::Manip cpan Sort::Naturally cpan Digest::MD5 cpan Digest::MD5::File cpan Statistics::Basic cpan Email::Send::SMTP::Gmail cpan Math::Derivative

Then put these into a place where NiDB can find them:

mkdir /usr/local/lib64/perl5 cp -rv /root/perl5/lib/perl5/* /usr/local/lib64/perl5/

Disabling SELinux is not strictly necessary provided that you ensure that all processes related to NiDB (webserver, database server), and all its files, belong to the same user, `nidb`

, and that file access policies are set correctly. In any case, you may feel this is useful so as to stop receiving too many irrelevant warnings during the installation. You can enable it again later.

sed -i 's/^SELINUX=.*/SELINUX=disabled/g' /etc/selinux/config setenforce 0

Note that enabling or disabling SELinux requires a reboot to take effect (it is not sufficient to simply restart a daemon; there is not one in fact).

FSL functions are used by various internal scripts. After the installation, make sure the environment variable `FSLDIR`

exists and points to the correct location (typically `/usr/local/fsl`

, but can be different if you installed it elsewhere). This variable is used below when defining the `crontab`

jobs.

FSLDIR=/usr/local/fsl

The official Github repository is https://github.com/gbook/nidb. However, I have made a fork with a couple of changes that better adapt to the system I am working with. You can probably go with either way.

mkdir -p ${NIDBROOT} cd ${NIDBROOT} mkdir -p archive backup dicomincoming deleted download ftp incoming problem programs/lock programs/logs uploadtmp uploaded git clone https://github.com/andersonwinkler/nidb install cd install cp -Rv setup/Mysql* /usr/local/lib64/perl5/ cp -Rv programs/* ${NIDBROOT}/programs/ cp -Rv web/* ${WWWROOT}/ chown -R ${NIDBUSER}:${NIDBUSER} ${NIDBROOT} chown -R ${NIDBUSER}:${NIDBUSER} ${WWWROOT}

Edit the file `${WWWROOT}/functions.php`

and complete two pieces of configuration. Locate these two lines:

$cfg = LoadConfig(); date_default_timezone_set();

In the first parenthesis, `()`

, put what you get when you run:

echo "${NIDBROOT}/programs/nidb.cfg"

whereas in the second `()`

, put what you get when you run:

timedatectl | grep "Time zone:" | awk '{print $3}'

For example, depending on your variables and time zone, you could edit to look like this:

$cfg = LoadConfig("/nidb/programs/nidb.cfg"); date_default_timezone_set("America/New_York")

First, create the `nidb`

user in MySQL/MariaDB. This is the only user (other than root) that will be able to do anything in the database:

mysql -uroot -p${MYSQLROOTPASS} -e "CREATE USER '${MYSQLUSER}'@'%' IDENTIFIED BY '${MYSQLPASS}'; GRANT ALL PRIVILEGES ON *.* TO '${MYSQLUSER}'@'%';"

Now create the NiDB database proper:

cd ${NIDBROOT}/install/setup mysql -uroot -p${MYSQLROOTPASS} -e "CREATE DATABASE IF NOT EXISTS nidb; GRANT ALL ON *.* TO 'root'@'localhost' IDENTIFIED BY '${MYSQLROOTPASS}'; FLUSH PRIVILEGES;" mysql -uroot -p${MYSQLROOTPASS} nidb < nidb.sql mysql -uroot -p${MYSQLROOTPASS} nidb < nidb-data.sql

These jobs will take care of various automated input/output tasks.

cat <<EOC > ~/tempcron.txt * * * * * cd ${NIDBROOT}/programs; perl parsedicom.pl > /dev/null 2>&1 * * * * * cd ${NIDBROOT}/programs; perl modulemanager.pl > /dev/null 2>&1 * * * * * cd ${NIDBROOT}/programs; perl pipeline.pl > /dev/null 2>&1 * * * * * cd ${NIDBROOT}/programs; perl datarequests.pl > /dev/null 2>&1 * * * * * cd ${NIDBROOT}/programs; perl fileio.pl > /dev/null 2>&1 * * * * * cd ${NIDBROOT}/programs; perl importuploaded.pl > /dev/null 2>&1 * * * * * cd ${NIDBROOT}/programs; perl qc.pl > /dev/null 2>&1 * * * * * FSLDIR=${FSLDIR}; PATH=${FSLDIR}/bin:${PATH}; . ${FSLDIR}/etc/fslconf/fsl.sh; export FSLDIR PATH; cd ${NIDBROOT}/programs; perl mriqa.pl > /dev/null 2>&1 @hourly find ${NIDBROOT}/programs/logs/*.log -mtime +4 -exec rm {} \; @daily /usr/bin/mysqldump nidb -u root -p${MYSQLROOTPASS} | gzip > ${NIDBROOT}/backup/db-\$(date +%Y-%m-%d).sql.gz @hourly /bin/find /tmp/* -mmin +120 -exec rm -rf {} \; @daily find ${NIDBROOT}/ftp/* -mtime +7 -exec rm -rf {} \ @daily find ${NIDBROOT}/tmp/* -mtime +7 -exec rm -rf {} \; EOC crontab -u ${NIDBUSER} ~/tempcron.txt && rm ~/tempcron.txt

The main configuration file, `${NIDBROOT}/programs/nidb.cfg`

, should be edited to reflect your paths, usernames, and passwords. It is this file that will contain the admin password for accessing NiDB. Use the `${NIDBROOT}/programs/nidb.cfg.sample`

as an example.

Once you have logged in as admin, you can also edit this file again in the database interface, in the menu Admin -> NiDB Settings.

It will likely increase your productivity when doing maintenance to have a friendly frontend for MySQL/MariaDB. Two popular choices are phpMyAdmin (web-based) and Oracle MySQL Workbench.

For phpMyAdmin:

wget https://www.phpmyadmin.net/downloads/phpMyAdmin-latest-english.zip unzip phpMyAdmin-latest-english.zip mv phpMyAdmin-*-english ${WWWROOT}/phpMyAdmin chown -R ${NIDBUSER}:${NIDBUSER} ${WWWROOT} chmod 755 ${WWWROOT} cp ${WWWROOT}/phpMyAdmin/config.sample.inc.php ${WWWROOT}/phpMyAdmin/config.inc.php

For MySQL Workbench, the repositories are listed at this link:

wget http://dev.mysql.com/get/mysql57-community-release-el7-11.noarch.rpm rpm -Uvh mysql57-community-release-el7-11.noarch.rpm yum install mysql-workbench

However, at the time of this writing, the current version (6.3.10) crashes upon start. The solution is to downgrade:

yum install yum-plugin-versionlock yum versionlock mysql-workbench-community-6.3.8-1.el7.* yum install mysql-workbench-community

You should by now have a working installation of NiDB, accessible from your web-browser at `http://localhost`

. There are additional pieces you may consider configuring, such as a listener in one of your server ports to bring DICOMs from the scanner in automatically as the images are collected, and also other changes to the database schema and web interface. Now you have a starting point.

For more information on NiDB, see these two papers:

- Book GA, Anderson BM, Stevens MC, Glahn DC, Assaf M, Pearlson GD. Neuroinformatics database (NiDB) – A modular, portable database for the storage, analysis, and sharing of neuroimaging data.
*Neuroinformatics.*2013;11(4):495-505. - Book GA, Stevens MC, Assaf M, Glahn DC, Pearlson GD. Neuroimaging data sharing on the neuroinformatics database platform.
*Neuroimage.*2016 Jan;124(0):1089-92.

]]>

Here we focus on the surface-based representation as that offers advantages over volume-based representations (Van Essen et al., 1998). Software such as **FreeSurfer** uses magnetic resonance images to initially construct the white surface. Once that surface has been produced, a copy of it can be offset outwards until tissue contrast in the magnetic resonance image is maximal, which indicates the location of the pial surface. This procedure ensures that both white and pial surfaces have the same topology, with each face and each vertex of the white surface having their matching pair in the pial. This convenience facilitates the computations indicated below.

For a triangular face of the surface representation, with vertex coordinates , , and , the area is , where , , represents the cross product, and the bars represent the vector norm. Even though such area per face (i.e., facewise) can be used in subsequent steps, most software packages can only deal with values assigned to each vertex (i.e., vertexwise). Conversion from facewise to vertexwise is achieved by assigning to each vertex one-third of the sum of the areas of all faces that meet at that vertex (Winkler et al., 2012).

The thickness at each vertex is computed as the average of two distances (Fischl and Dale, 2000; Greve and Fischl, 2018): the first is the distance from each white surface vertex to their corresponding closest point on the pial surface (not necessarily at a pial vertex); the second is the distance from the corresponding pial vertex to the closest point on the white surface (again, not necessarily at a vertex). Other methods are possible, however, see table below (adapted from Lerch and Evans, 2005):

Method | Reference |
---|---|

Distance solved using the Laplace’s equation. | Jones et al. (2000) |

Distance between corresponding vertices. | MacDonald et al. (2000) |

Distance to the nearest point in the other surface. | MacDonald et al. (2000) |

Distance to the nearest point in the other surface, computed for both surfaces, then averaged. | Fischl and Dale (2000) |

Distance along the normal. | MacDonald et al. (2000) |

Distance along the iteratively computed normal. | Lerch and Evans (2005) |

If the area of either of these surfaces is known, or if the area of a mid-surface, i.e., the surface running half-distance between pial and white surfaces is known, an estimate of the volume can be obtained by multiplying, at each vertex, area by thickness. This procedure is still problematic in that it underestimates the volume of tissue that is external to the convexity of the surface, and overestimates volume that is internal to it; both cases are undesirable, and cannot be solved by merely resorting to using an intermediate surface as the mid-surface.

In **Winkler et al. (2018)** we propose a different approach to measure volume. Instead of computing the product of thickness and area, we note that any pair of matching faces can be used to define an irregular polyhedron, of which all six coordinates are known from the surface geometry. This polyhedron is an oblique truncated triangular pyramid, which can be perfectly divided into three irregular tetrahedra, which do not overlap, nor leave gaps.

From the coordinates of the vertices of these tetrahedra, their volumes can be computed analytically, then added together, viz.:

- For a given face in the white surface, and its corresponding face in the pial surface, define an oblique truncated triangular pyramid.
- Split this truncated pyramid into three tetrahedra, defined as:

- For each such tetrahedra, let , , and represent its four vertices in terms of coordinates . Compute the volume as , where , , , the symbol represents the cross product, represents the dot product, and the bars represent the vector norm.

No error other than what is intrinsic to the placement of these surfaces is introduced. The resulting volume can be assigned to each vertex in a similar way as conversion from facewise area to vertexwise area. The above method is the default in **FreeSurfer 6.0.0**.

Given that volume of the cortex is, ultimately, determined by area and thickness, and these are known to be influenced in general by different factors (Panizzon et al, 2009; Winkler et al, 2010), why would anyone bother in even measuring volume? The answer is that not all factors that can affect the cortex will affect exclusively thickness or area. For example, an infectious process, or the development of a tumor, have potential to affect both. Volume is a way to assess the effects of such non-specific factors on the cortex. However, even in that case there are better alternatives available, namely, the **non-parametric combination (NPC)** of thickness and area. This use of NPC will be discussed in a future post here in the blog.

- Fischl B, Dale AM. Measuring the thickness of the human cerebral cortex from magnetic resonance images.
*Proc Natl Acad Sci U S A.*2000 Sep 26;97(20):11050–5. - Greve DN, Fischl B. False positive rates in surface-based anatomical analysis.
*Neuroimage.*2018 Dec 26;171(1 May 2018):6–14. - Hutton C, De Vita E, Ashburner J, Deichmann R, Turner R. Voxel-based cortical thickness measurements in MRI.
*Neuroimage*. 2008 May 1;40(4):1701–10. - Jones SE, Buchbinder BR, Aharon I. Three-dimensional mapping of cortical thickness using Laplace’s equation.
*Hum Brain Mapp.*2000 Sep;11(1):12–32. - Lerch JP, Evans AC. Cortical thickness analysis examined through power analysis and a population simulation.
*Neuroimage*. 2005;24(1):163–73. - MacDonald D, Kabani NJ, Avis D, Evans AC. Automated 3-D extraction of inner and outer surfaces of cerebral cortex from MRI.
*Neuroimage.*2000 Sep;12(3):340–56. - Panizzon MS, Fennema-Notestine C, Eyler LT, Jernigan TL, Prom-Wormley E, Neale M, et al. Distinct genetic influences on cortical surface area and cortical thickness.
*Cereb Cortex.*2009 Nov;19(11):2728–35. - Van Essen DC, Drury HA, Joshi S, Miller MI. Functional and structural mapping of human cerebral cortex: solutions are in the surfaces.
*Proc Natl Acad Sci U S A.*1998 Feb 3;95(3):788–95. - Winkler AM, Kochunov P, Blangero J, Almasy L, Zilles K, Fox PT, et al. Cortical thickness or grey matter volume? The importance of selecting the phenotype for imaging genetics studies.
*Neuroimage.*2010 Nov 15;53(3):1135–46. - Winkler AM, Sabuncu MR, Yeo BTT, Fischl B, Greve DN, Kochunov P, et al. Measuring and comparing brain cortical surface area and other areal quantities.
*Neuroimage.*2012 Jul 15;61(4):1428–43. - Winkler AM, Greve DN, Bjuland KJ, Nichols TE, Sabuncu MR, Ha Berg AK, et al. Joint analysis of cortical area and thickness as a replacement for the analysis of the volume of the cerebral cortex.
*Cereb Cortex.*2018 Feb 1;28(2):738–49.

In FSL, when we create a design using the graphical interface in FEAT, or with the command `Glm`

, we are given the opportunity to define, at the higher-level, the “Group” to which each observation belongs. When the design is saved, the information from this setting is stored in a text file named something as “design.grp”. This file, and thus the group setting, takes different roles depending whether the analysis is used in FEAT itself, in PALM, or in randomise.

What can be confusing sometimes is that, in all three cases, the “Group” indicator does not refer to experimental or observational group of any sort. Instead, it refers to *variance groups* (VG) in FEAT, to *exchangeability blocks* (EB) in randomise, and to either VG or EB in PALM, depending on whether the file is supplied with the options `-vg`

or `-eb`

.

In FEAT, unless there is reason to suspect (or assume) that the variances for different observations are not equal, all subjects should belong to group “1”. If variance groups are defined, then these are taken into account when the variances are estimated. This is only possible if the design matrix is “separable”, that is, it must be such that, if the observations are sorted by group, the design can be constructed by direct sum (i.e., block-diagonal concatenation) of the design matrices for each group separately. A design is not separable if any explanatory variable (EV) present in the model crosses the group borders (see figure below). Contrasts, however, can encompass variables that are defined across multiple VGs.

The variance groups not necessarily must match the experimental observational groups that may exist in the design (for example, in a comparison of patients and controls, the variance groups may be formed based on the sex of the subjects, or another discrete variable, as opposed to the diagnostic category). Moreover, the variance groups can be defined even if all variables in the model are continuous.

In randomise, the same “Group” setting can be supplied with the option `-e design.grp`

, thus defining exchangeability blocks. Observations within a block can only be permuted with other observations within that same block. If the option `--permuteBlocks`

is also supplied, then the EBs must be of the same size, and the blocks as a whole are instead then permuted. Randomise does not use the concept of variance group, and all observations are always members of the same single VG.

In PALM, using `-eb design.grp`

has the same effect that `-e design.grp`

has in randomise. Further using the option `-whole`

is equivalent to using `--permuteBlocks`

in randomise. It is also possible to use together `-whole`

and `-within`

, meaning that the blocks as a whole are shuffled, and further, observations within block are be shuffled. In PALM the file supplied with the option `-eb`

can have multiple columns, indicating multi-level exchangeability blocks, which are useful in designs with more complex dependence between observations. Using `-vg design.grp`

causes PALM to use the *v*– or *G*-statistic, which are replacements for the *t*– and *F*-statistics respectively for the cases of heterogeneous variances. Although VG and EB are not the same thing, and may not always match each other, the VGs can be defined from the EBs, as exchangeability implies that some observations must have same variance, otherwise permutations are not possible. The option `-vg auto`

defines the variance groups from the EBs, even for quite complicated cases.

In both FEAT and PALM, defining VGs will only make a difference if such variance groups are not balanced, i.e., do not have the same number of observations, since heteroscedasticity (different variances) only matter in these cases. If the groups have the same size, all subjects can be allocated to a single VG (e.g., all “1”).

]]>To take the multiple testing problem into account, either the test level (), or the p-values can be adjusted, such that instead of controlling the error rate at each individual test, the error rate is controlled for the whole set (family) of tests. Controlling such *family-wise error rate* (FWER) ensures that the chance of finding a significant result *anywhere* in the image is expected to be within a certain predefined level. For example, if there are 1000 voxels, and the FWER-adjusted test level is 0.05, we expect that, if the experiment is repeated for all the voxels 20 times, then on average in one of these repetitions there will be an error somewhere in the image. The adjustment of the p-values or of the test level is done using the distribution of the maximum statistic, something that most readers of this blog are certainly well aware of, as that permeates most of the imaging literature since the early 1990s.

*Have you ever wondered why?* What is so special about the distribution of the maximum that makes it useful to correct the error rate when there are multiple tests?

Say we have a set of voxels in an image. For a given voxel , , with test statistic , the probability that is larger than some cutoff is denoted by:

where is the cumulative distribution function (cdf) of the test statistic. If the cutoff is used to accept or reject a hypothesis, then we say that we have a *false positive* if an observed is larger than when there is no actual true effect. A false positive is also known as error type I (in this post, the only type of error discussed is of the type I).

For an image (or any other set of tests), if there is an error anywhere, we say that a *family-wise error* has occurred. We can therefore define a “family-wise null hypothesis” that there is no signal anywhere; to reject this hypothesis, it suffices to have a single, lonely voxel in which . With many voxels, the chances of this happening increase, even if there no effect is present. We can, however, adjust our cuttoff to some other value so that the probability of rejecting such family-wise null hypothesis remains within a certain level, say .

The “family-wise null hypothesis” is effectively a joint null hypothesis that there is no effect anywhere. That is, it is an union-intersection test (UIT; Roy, 1953). This joint hypothesis is retained if all tests have statistics that are below the significance cutoff. What is the probability of this happening? From the above we know that . The probability of the same happening for all voxels simultaneously is, therefore, simply the product of such probabilities, assuming of course that the voxels are all independent:

Thus, the probability that any voxel has a significant result, which would lead to the occurrence of a family-wise error, is . If all voxels have identical distribution under the null, then the same is stated as .

Consider the maximum of the set of voxels, that is, . The random variable is only smaller or equal than some cutoff if all values are smaller or equal than . If the voxels are independent, this enables us to derive the cdf of :

.

Thus, the probability that is larger than some threshold is . If all voxels have identical distribution under the null, then the same is stated as .

These results, lo and behold, are the same as those used for the UIT above, hence how the distribution of the maximum can be used to control the family-wise error rate (if the distribution of the maximum is computed via permutations, independence is not required).

The above is not the only way in which we can see why the distribution of the maximum allows the control of the family-wise error rate. The work by Marcus, Peritz and Gabriel (1976) showed that, in the context of multiple testing, the null hypothesis for a particular test can be rejected provided that all possible joint (multivariate) tests done within the set and including are also significant, and doing so controls the family-wise error rate. For example, if there are four tests, , the test in is considered significant if the joint tests using (1,2,3,4), (1,2,3), (1,2,4), (1,3,4), (1,2), (1,3), (1,4) and (1) are all significant (that is, all that include ). Such joint test can be quite much any valid test, including Hotelling’s , MANOVA/MANCOVA, or NPC (Non-Parametric Combination), all of which are based on recomputing the test statistic from the original data, or others, based on the test statistics or p-values of each of the elementary tests, as in a meta-analysis.

Such closed testing procedure (CTP) incurs an additional problem, though: the number of joint tests that needs to be done is , which in imaging applications renders them unfeasible. However, there is one particular joint test that provides a direct algorithmic shortcut: using the as the test statistic for the joint test. The maximum across all tests is also the maximum for any subset of tests, such that these can be skipped altogether. This gives a vastly efficient algorithmic shortcut to a CTP, as shown by Westfall and Young (1993).

One does not need to chase the original papers cited above (although doing so cannot hurt). Broadly, the same can be concluded based solely on intuition: if the distribution of some test statistic that is not the distribution of the maximum within an image were used as the reference to compute the (FWER-adjusted) p-values at a given voxel , then the probability of finding a voxel with a test statistic larger than anywhere could not be determined: there could always be some other voxel , with an even larger statistic (i.e., ), but the probability of such happening would not be captured by the distribution of a non-maximum. Hence the chance of finding a significant voxel *anywhere* in the image under the null hypothesis (the very definition of FWER) would not be controlled. Using the absolute maximum eliminates this logical leakage.

- Marcus R, Peritz E, Gabriel KR. On closed testing pocedures with special reference to ordered analysis of variance.
*Biometrika*. 1976 Dec;63(3):655. - Nichols T, Hayasaka S. Controlling the familywise error rate in functional neuroimaging: a comparative review.
*Stat Methods Med Res*. 2003 Oct;12(5):419–46. - Roy SN. On a heuristic method of test construction and its use in multivariate analysis.
*Ann Math Stat*. 1953 Jun;24(2):220–38. - Westfall PH, Young SS.
*Resampling-based multiple testing: examples and methods for p-value adjustment.*New York, Wiley, 1993.

Permutation tests are more robust and help to make scientific results more reproducible by depending on fewer assumptions. However, they are computationally intensive as recomputing a model thousands of times can be slow. The purpose of this post is to briefly list some options available for speeding up permutation.

Firstly, no speed-ups may be needed: for small sample sizes, or low resolutions, or small regions of interest, a permutation test can run in a matter of minutes. For larger data, however, accelerations may be of use. One option is acceleration through parallel processing or GPUs (for example applications of the latter, see Eklund et al., 2012, Eklund et al., 2013 and Hernández et al., 2013; references below), though this does require specialised implementation. Another option is to reduce the computational burden by exploiting the properties of the statistics and their distributions. A menu of options includes:

- Do few permutations (shorthand name:
*fewperms*). The results remain valid on average, although the p-values will have higher variability. - Keep permuting until a fixed number of permutations with statistic larger than the unpermuted is found (a.k.a., negative binomial; shorthand name:
*negbin*). - Do a few permutations, then approximate the tail of the permutation distribution by fitting a generalised Pareto distribution to its tail (shorthand name:
*tail*). - Approximate the permutation distribution with a gamma distribution, using simple properties of the test statistic itself, amazingly not requiring any permutations at all (shorthand name:
*noperm*). - Do a few permutations, then approximate the full permutation distribution by fitting a gamma distribution (shorthand name:
*gamma*). - Run permutations on only a few voxels, then fill the missing ones using low-rank matrix completion theory (shorthand name:
*lowrank*).

These strategies allow accelerations >100x, yielding nearly identical results as in the non-accelerated case. Some, such as tail approximation, are generic enough to be used nearly all the most common scenarios, including univariate and multivariate tests, spatial statistics, and for correction for multiple testing.

In addition to accelerating permutation tests, some of these strategies, such as *tail* and *noperm*, allow continuous p-values to be found, and refine the p-values far into the tail of the distribution, thus avoiding the usual discreteness of p-values, which can be a problem in some applications if too few permutations are done.

These methods are available in the tool **PALM — Permutation Analysis of Linear Models** — and the complete description, evaluation, and application to the re-analysis of a voxel-based morphometry study (Douaud et al., 2007) have been just published in **Winkler et al., 2016** (for the Supplementary Material, click **here**). The paper includes a flow chart prescribing these various approaches for each case, reproduced below.

The hope is that these accelerations will facilitate the use of permutation tests and, if used in combination with hardware and/or software improvements, can further expedite computation leaving little reason not to use these tests.

- Douaud G, Smith S, Jenkinson M, Behrens T, Johansen-Berg H, Vickers J, James S, Voets N, Watkins K, Matthews PM, James A. Anatomically related grey and white matter abnormalities in adolescent-onset schizophrenia.
*Brain*. 2007 Sep;130(Pt 9):2375-86. Epub 2007 Aug 13. - Eklund A, Andersson M, Knutsson H. fMRI analysis on the GPU-possibilities and challenges.
*Comput Methods Programs Biomed.*2012 Feb;105(2):145-61. - Eklund A, Dufort P, Forsberg D, LaConte SM. Medical image processing on the GPU – Past, present and future.
*Med Image Anal.*2013;17(8):1073-94. - Hernández M, Guerrero GD, Cecilia JM, García JM, Inuggi A, Jbabdi S, et al. Accelerating fibre orientation estimation from diffusion weighted magnetic resonance imaging using GPUs.
*PLoS One.*2013 Jan;8(4):e61892.1. - Winkler AM, Ridgway GR, Webster MA, Smith SM, Nichols TE. Permutation inference for the general linear model.
*Neuroimage.*2014 May 15;92:381-97. - Winkler AM, Ridgway GR, Douaud G, Nichols TE, Smith SM. Faster permutation inference in brain imaging.
*Neuroimage*. 2016 Jun 7;141:502-516.

**Contributed to this post:** Tom Nichols, Ged Ridgway.