Simplifying Freedman-Lane

Doing a permutation test with the general linear model (GLM) in the presence of nuisance variables can be challenging. Let the model be:

\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon}

where \mathbf{Y} is a matrix of observed variables, \mathbf{X} is a matrix of predictors of interest, \mathbf{Z} is a matrix of covariates (of no interest), and \boldsymbol{\epsilon} is a matrix of the same size as \mathbf{Y} with the residuals.

Because the interest is in testing the relationship between \mathbf{Y} and \mathbf{X}, in principle it would be these that would need be permuted, but doing so also breaks the relationship with \mathbf{Z}, which would be undesirable. Over the years, many methods have been proposed. A review can be found in Winkler et al. (2014); other previous work include the papers by Anderson and Legendre (1999) and Anderson and Robinson (2001).

One of these various methods is the one published in Freedman and Lane (1983), which consists of permuting data that has been residualised with respect to the covariates, then estimated covariate effects added back, then the full model fitted again. The procedure can be performed through the following steps:

  1. Regress \mathbf{Y} against the full model that contains both the effects of interest and the nuisance variables, i.e., \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon}. Use the estimated parameters \boldsymbol{\hat{\beta}} to compute the statistic of interest, and call this statistic T_{0}.
  2. Regress \mathbf{Y} against a reduced model that contains only the covariates, i.e. \mathbf{Y} = \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon}_{\mathbf{Z}}, obtaining estimated parameters \boldsymbol{\hat{\gamma}} and estimated residuals \boldsymbol{\hat{\epsilon}}_{\mathbf{Z}}.
  3. Compute a set of permuted data \mathbf{Y}^{*}_{j}. This is done by pre-multiplying the residuals from the reduced model produced in the previous step, \boldsymbol{\hat{\epsilon}}_{\mathbf{Z}}, by a permutation matrix, \mathbf{P}{j}, then adding back the estimated nuisance effects, i.e. \mathbf{Y}^{*}_{j} = \mathbf{P}_{j}\boldsymbol{\hat{\epsilon}}_{\mathbf{Z}} + \mathbf{Z}\boldsymbol{\hat{\gamma}}.
  4. Regress the permuted data \mathbf{Y}^{*}_{j} against the full model, i.e. \mathbf{Y}^{*}_{j} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon}
  5. Use the estimated \boldsymbol{\hat{\beta}}^{*}_{j} to compute the statistic of interest. Call this statistic T^{*}_{j}.
  6. Repeat the Steps 2-4 many times to build the reference distribution of T^{*} under the null hypothesis of no association between \mathbf{Y} and \mathbf{X}.
  7. Count how many times T^{*}_{j} was found to be equal to or larger than T_{0}, and divide the count by the number of permutations; the result is the p-value.

Steps 1-4 can be written concisely as:

\left(\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}+\mathbf{H}_{\mathbf{Z}}\right) \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma}+\boldsymbol{\epsilon}

where \mathbf{P}_{j} is a permutation matrix (for the j-th permutation, \mathbf{H}_{\mathbf{Z}}=\mathbf{Z}\mathbf{Z}^{+} is the hat matrix due to the covariates, and \mathbf{R}_{\mathbf{Z}} = \mathbf{I} - \mathbf{H}_{\mathbf{Z}} is the residual forming matrix; the superscript symbol ^{+} represents a matrix pseudo-inverse.

In page 385 of Winkler et al. (2014), my colleagues and I state that:

[…] add the nuisance variables back in Step 3 is not strictly necessary, and the model can be expressed simply as \mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}\mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\mathbf{Z}\boldsymbol{\gamma}+\boldsymbol{\epsilon}, implying that the permutations can actually be performed just by permuting the rows of the residual-forming matrix \mathbf{R}_{\mathbf{Z}}.

However, in the paper we do not offer any proof of this important result, that allows algorithmic acceleration. Here we remedy that. Let’s start with two brief lemmata:

Lemma 1: The product of a hat matrix and its corresponding residual-forming matrix is zero, that is, \mathbf{R}_{\mathbf{Z}}\mathbf{H}_{\mathbf{Z}} = \mathbf{H}_{\mathbf{Z}}\mathbf{R}_{\mathbf{Z}} = \mathbf{0}.

This is because \mathbf{R}_{\mathbf{Z}} = \mathbf{I} - \mathbf{H}_{\mathbf{Z}}, hence \mathbf{R}_{\mathbf{Z}}\mathbf{H}_{\mathbf{Z}} = \mathbf{R}_{\mathbf{Z}}(\mathbf{I} - \mathbf{R}_{\mathbf{Z}}) = \mathbf{R}_{\mathbf{Z}} - \mathbf{R}_{\mathbf{Z}}\mathbf{R}_{\mathbf{Z}} = \mathbf{R}_{\mathbf{Z}} - \mathbf{R}_{\mathbf{Z}} = \mathbf{0} since \mathbf{R}_{\mathbf{Z}} is idempotent.

Lemma 2: Given a GLM expressed as \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon}, we can estimate \boldsymbol{\beta} from an equivalent GLM written as \mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta} + \mathbf{R}_{\mathbf{Z}}\boldsymbol{\epsilon}.

To see why, note that multiplying both sides of an equation by the same factor does not change it. Let’s start from:

\mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}(\mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\epsilon})

Then remove the parentheses:

\mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta} + \mathbf{R}_{\mathbf{Z}}\mathbf{Z}\boldsymbol{\gamma} + \mathbf{R}_{\mathbf{Z}}\boldsymbol{\epsilon}

Since \mathbf{R}_{\mathbf{Z}} = \mathbf{I} - \mathbf{H}_{\mathbf{Z}}:

\mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta} + (\mathbf{I}-\mathbf{H}_{\mathbf{Z}})\mathbf{Z}\boldsymbol{\gamma} + \mathbf{R}_{\mathbf{Z}}\boldsymbol{\epsilon}

and that \mathbf{H}_{\mathbf{Z}} = \mathbf{Z}\mathbf{Z}^{+}:

\mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta} + (\mathbf{Z}-\mathbf{Z}\mathbf{Z}^{+}\mathbf{Z})\boldsymbol{\gamma} + \mathbf{R}_{\mathbf{Z}}\boldsymbol{\epsilon}

Since \mathbf{Z}^{+}\mathbf{Z}=\mathbf{I}:

\mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta} + \mathbf{0}\boldsymbol{\gamma} + \mathbf{R}_{\mathbf{Z}}\boldsymbol{\epsilon}

\mathbf{R}_{\mathbf{Z}}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}_{\mathbf{Z}}

where \boldsymbol{\epsilon}_{\mathbf{Z}}=  \mathbf{R}_{\mathbf{Z}}\boldsymbol{\epsilon}.

Main result

Now we are ready for the main proof. The Freedman-Lane model is:

\left(\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}+\mathbf{H}_{\mathbf{Z}}\right) \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma}+\boldsymbol{\epsilon}

Per Lemma 2, it can be rewritten as:

\mathbf{R}_{\mathbf{Z}}\left(\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}+\mathbf{H}_{\mathbf{Z}}\right) \mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}_{\mathbf{Z}}

Dropping the parenthesis:

\mathbf{R}_{\mathbf{Z}}\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}\mathbf{Y} +\mathbf{R}_{\mathbf{Z}}\mathbf{H}_{\mathbf{Z}} \mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}_{\mathbf{Z}}

Per Lemma 1:

\mathbf{R}_{\mathbf{Z}}\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}\mathbf{Y} + \mathbf{0}\mathbf{Y} = \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}_{\mathbf{Z}}

\mathbf{R}_{\mathbf{Z}}\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}}\mathbf{Y}= \mathbf{R}_{\mathbf{Z}}\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}_{\mathbf{Z}}

What is left has the same form as the result of Lemma 2. Thus, reversing it, we obtain the final result:

\mathbf{P}_{j}\mathbf{R}_{\mathbf{Z}} \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} +\boldsymbol{\epsilon}

Hence, the hat matrix \mathbf{H}_{\mathbf{Z}} cancels out, meaning that it is not necessary. Results are the same both ways.

References

The German tank problem and the novel coronavirus.

We often think of statistics as a way to summarize large amounts of data. For example, we can collect data from thousands of subjects, and extract a single number that tells something about these subjects. The well known German tank problem shows that, in a certain way, statistics can also be used for the opposite: using incomplete data and a few reasonable assumptions (or real knowledge), statistics provides way to estimate information that offer a panoramic view of all the data. Historical problems are interesting on their own. Yet, it is not always that we see so clearly consequential historical events at the time they happen — like now.

In the Second World War, as in any other war, information could be more valuable than anything else. Intelligence reports (such as from spies) would feed the Allies with information about the industrial capacity of Nazi Germany, including details about things such as the number of tanks produced. This kind of information can have far reaching impact and not only determine the outcome of a battle, but also if a battle would even even happen or with what preparations, as the prospect of finding a militarily superior opponent is often a great deterrent.

Sometimes, German tanks, as the well known Panzer, could be captured and carefully inspected. Among the details noted were the serial number printed in various pieces, such as chassis, gearboxes, and the serial numbers of the moulds used to produce the wheels. With the serial number of even a single chassis, for example one can estimate the total number of tanks produced; knowing the serial number of a single wheel mould allows the estimation of the total number of moulds, and thus, how many wheels can be produced in a certain amount of time. But how?

If serial numbers are indeed serial, e.g., \{1, 2, 3, \ldots, N\}, growing uniformly and without gaps, and we see a tank that has a serial number S, then clearly at least S tanks must have been produced. But could we have a better guess?

Let’s start by reversing the problem: suppose we knew N. In that case, what would be the average value of the serial numbers of all N tanks? The average for uniformly distributed data like this would be M = \frac{1 + N}{2}, that is, the average of the first and last serial numbers.

Now, say we have only one sighting of a tank, and that has serial number S. Then our best guess for the average serial number is S itself, as we have no additional information. Thus, with M = S, our guess would be N = 2S - 1 (that is, reorganizing the terms of the previous equation for M). Note that, for one sighting, this formula guarantees that N is larger or equal than S, which makes sense: we cannot have an estimate for N that is smaller than the serial number S itself.

What if we had not just one, but multiple sightings? Call the number of sightings K. The mean is now M = \frac{S_1 + S_2 + \ldots + S_K}{K}, for ordered serial numbers \{S_1, S_2, \ldots, S_K\}. Clearly, we can’t use the same formula, because if M is much smaller than S_K (say, because we have seen many small serial numbers, but just a handful of larger ones), N could incorrectly be estimated as less than S_K, which makes no sense. At least S_K tanks must exist.

While incorrect for K > 1, the above formula gives invaluable insight: it shows that for such uniformly distributed data, approximately half of the tanks have serial number above M, the other half below M. Extending the idea, and still under the assumption that the serial numbers are uniform, we can conclude that the number of tanks below the lowest serial number S_1 (which is S_1 - 1) must be approximately the same as the (unknown) number of tanks above the highest serial number S_K. So, a next better estimate could be to use N = S_K + S_1 - 1.

We can still do better, though. Since we have K sightings, we can estimate what is the average interval between sightings, i.e., \frac{S_K}{K}. As it is based on all K sightings, this gives a better estimate of the spacing between the serial numbers than the single sighting S_1. The result can be added to S_K. The final estimate then becomes N = S_K + \frac{S_K}{K} - 1.

To make this concrete, say we saw tanks numbered \{47, 62, 104, 155, 159\}. Then our best guess would be N \approx 190.

At the end of the war, estimates obtained using the above method proved remarkably accurate, much more so than information provided by spies and other intelligence reports.

Let’s now see a similar example that is contemporary to us. Take the current pandemic caused by a novel coronavirus. The World Health Organization stated officially, in 14th January 2020, when there were 41 official cases reported in China, that there was no evidence for human-to-human transmission. Yet, when the first 3 cases outside China were confirmed in 16th January 2020, epidemiologists at the Imperial College London were quick to find out that that must not have been true. Rather, the real number of cases was likely well above 1700.

How did they make that estimate? The key insight was the realisation that only a small number of people in any major city travels internationally, particularly in such a short time span like that given by the time until the onset of symptoms for this kind of respiratory disease. If one can estimate prevalence among those who travelled, that would be a good approximation to the prevalence among those who live in the city, assuming that those who travel are an unbiased sample of the population.

Following this idea, we have: \frac{C_t}{N_t} \approx \frac{C_s}{N_s}, that is, the number of cases among those who travelled (C_t) divided by the total number of people who travelled (N_t) is expected to be approximately the same as the number of cases among those who stayed (C_s) divided by the total number of people who stayed (live) in the city (N_s).

The number of people served by the international airport of Wuhan is about 19 million (the size of the Wuhan metropolitan area), and the average daily number of outbound international passengers in previous years was 3301 for that time of the year (a figure publicly known, from IATA). Unfortunately, little was known outside China about the time taken between exposure to the virus and the onset of symptoms. The researchers then resorted to a proxy: the time known for the related severe respiratory disorder known as MERS, also caused by a coronavirus, which is about 10 days. Thus, we can estimate N_t= 3301 \times 10=33010 people travelling out, and N_s = 19,000,000 staying in the city. The number of known international cases was at the time C_t = 3. Hence:

C_s \approx \frac{3\times 19,000,000}{33010}\approx 1727 cases

So, using remarkably simple maths, simpler even than in our WWII German tank example, the scientists estimated that the number of actual cases in the city of Wuhan was likely far above the official figure of 41 cases. The researchers were careful to indicate that, should the probability of travelling be higher among of being exposed, the number of actual cases could be smaller. The converse is true: should travellers be wealthier (thus less likely to be exposed to a possible zoonosis), the number of actual cases would be higher.

Importantly, it is not at all likely that 1700 people would have contracted such a zoonosis from wild animals in a dense urban area like Wuhan, hence human-to-human transmission must had been occurring. Eventually the WHO confirmed human-to-human transmission on 19th January 2020. Two days later, Chinese authorities began locking down and sealing off Wuhan, thus putting into place a plan to curb the transmission.

To find out more about the original problem of the number of tanks, and also for other methods of estimation for the same problem, a good start is this article. Also invaluable, for various estimation problems related to the fast dissemination of the novel coronavirus, are all the reports by the epidemiology team at the Imperial College London, which can be found here.