Redundancy in canonical correlation analysis

In canonical correlation analysis (CCA; Hotelling, 1936), the absolute value of a correlation is not always that helpful. For example, large canonical correlations may arise simply due to a large number of variables being investigated using a relatively small sample size; high correlations may arise simply because there are too many opportunities for finding mixtures in both sides that are highly correlated one with another.

Motivated by this perceived difficulty in the interpretation of results, Stewart and Love (1968) proposed the computation of what has been termed a redundancy index. It works as follows.

Let \mathbf{Y}_{N \times P} and \mathbf{X}_{N \times Q} be two sets of variables over which CCA is computed. We find canonical coefficients \mathbf{A}_{P \times K} and \mathbf{B}_{Q \times K}, K=\min(P,Q) such that the canonical variables \mathbf{U}_{N \times K} and \mathbf{V}_{N \times K} have maximal, diagonal correlation structure; this diagonal contains the ordered canonical correlations r_k.

Now that CCA has been computed, we can find the correlations between the original variables and the canonical coefficients. Let \mathbf{\tilde{A}}_{P \times K}=\text{corr}(\mathbf{Y},\mathbf{U}) and \mathbf{\tilde{B}}_{Q \times K}=\text{corr}(\mathbf{X},\mathbf{V}) be such correlations, which are termed canonical loadings or structure coefficients. Now compute the mean square for each of the columns of \mathbf{\tilde{A}} and \mathbf{\tilde{B}}. These represent the variance extracted by the corresponding canonical variable. That is:

  • Variance extracted by canonical variable \mathbf{u}_{k}: \upsilon_k = \frac{1}{P}\sum_{p=1}^{P}\tilde{a}_{pk}^{2}
  • Variance extracted by canonical variable \mathbf{v}_{k}: \nu_k = \frac{1}{Q}\sum_{q=1}^{Q}\tilde{b}_{qk}^{2}

These quantities represent the mean variance extracted from the original variables by each of the canonical variables (in each side).

Compute now the proportion of variance of one canonical variable (say, \mathbf{u}_{k}) explained by the corresponding canonical variable in the other side (say, \mathbf{v}_{k}). This is given simply by r_k^2, the usual coefficient of determination.

The redundancy index for each canonical variable is then the product of \upsilon_k and r_k^2 for the left side of CCA, and the product of \nu_k and r_k^2 for the right side. That is, the index is not symmetric. It measures the proportion of variance in one of the two set of variables explained by the correlation between the k-th pair of canonical variables.

The sum of the redundancies for all K canonical variables in one side or another forms a global redundancy metric, which indicates the proportion of the variance in a given side explained by the variance in the other.

This global redundancy can be scaled to unity, such that the redundancies for each of the canonical variables in a give side can be interpreted as the proportion of total redundancy.

If you follow the original paper by Stewart and Love (1968), \upsilon_k and \nu_k are column III of Table 2, the redundancy of each canonical variable for each side corresponds to column IV, and the proportion of total redundancy is in column V.

Another reference on the same topic that is worth looking is Miller (1981). In it, the author discusses that redundancy is somewhere in between CCA itself (fully symmetric) and multiple regression (fully asymmetric).

Reference

Update

  • 28.Jun.2020: A script that computes the redundancy indices is available here: redundancy.m (work with Thomas Wassenaar, University of Oxford).

Leave a comment