Three HCP utilities

If you are working with data from the Human Connectome Project (HCP), perhaps these three small Octave/MATLAB utilities may be of some use:

  • hcp2blocks.m: Takes the restricted file with information about kinship and zygosity and produces a multi-level exchangeability blocks file that can be used with PALM for permutation inference. It is fully described here.
  • hcp2solar.m: Takes restricted and unrestricted files to produce a pedigree file that can be used with SOLAR for heritability and genome-wide association analyses.
  • picktraits.m: Takes either restricted or unrestricted files, a list of traits and a list of subject IDs to produce tables with selected traits for the selected subjects. These can be used to, e.g., produce design matrices for subsequent analysis.

These functions need to parse relatively large CSV files, which is somewhat inefficient in MATLAB and Octave. Still, since these commands usually have to be executed only once for a particular analysis, a 1-2 minute wait seems acceptable.

If downloaded directly from the above links, remember also to download the prerequisites: strcsvread.m and strcsvwrite.m. Alternatively, clone the full repository from GitHub. The link is this. Other tools may be added in the future.

A fourth utility

For the HCP-S1200 release (March/2017), zygosity information is provided in the fields ZygositySR (self-reported zygosity) and ZygosityGT (zygosity determined by genetic methods for select subjects). If needed, these two fields can be merged into a new field named simply Zygosity. To do so, use a fourth utility, command mergezyg.

Variance components in genetic analyses

Pedigree-based analyses allow investigation of genetic and environmental influences on anatomy, physiology, and behaviour.

Methods based on components of variance have been used extensively to assess genetic influences and identify loci associated with various traits quantifying aspects of anatomy, physiology, and behaviour, in both normal and pathological conditions. In an earlier post, indices of genetic resemblance between relatives were presented, and in the last post, the kinship matrix was defined. In this post, these topics are used to present a basic model that allows partitioning of the phenotypic variance into sources of variation that can be ascribed to genetic, environmental, and other factors.

A simple model

Consider the model:

\mathbf{Y} = \mathbf{X}\mathbf{B} + \boldsymbol{\Upsilon}

where, for S subjects, T traits, P covariates and K variance components, \mathbf{Y}_{S \times T} are the observed trait values for each subject, \mathbf{X}_{S \times P} is a matrix of covariates, \mathbf{B}_{P \times T} is a matrix of unknown covariates’ weights, and \boldsymbol{\Upsilon}_{S \times T} are the residuals after the covariates have been taken into account.

The elements of each column t of \boldsymbol{\Upsilon} are assumed to follow a multivariate normal distribution \mathcal{N}\left(0;\mathbf{S}\right), where \mathbf{S} is the between-subject covariance matrix. The elements of each row s of \boldsymbol{\Upsilon} are assumed to follow a normal distribution \mathcal{N}\left(0;\mathbf{R}\right), where \mathbf{R} is the between-trait covariance matrix. Both \mathbf{R} and \mathbf{S} are seen as the sum of K variance components, i.e. \mathbf{R} = \sum_{k} \mathbf{R}_{k} and \mathbf{S} = \sum_{k} \mathbf{S}_{k}. For a discussion on these equalities, see Eisenhart (1947) [see references at the end].

An equivalent model

The same model can be written in an alternative way. Let \mathbf{y}_{S \cdot T \times 1} be the stacked vector of traits, \mathbf{\tilde{X}}_{S \cdot T \times P \cdot T} = \mathbf{X} \otimes \mathbf{I}_{T \times T} is the matrix of covariates, \boldsymbol{\beta}_{P \cdot T \times 1} the vector with the covariates’ weights, \boldsymbol{\upsilon}_{S \cdot T \times 1} the residuals after the covariates have been taken into account, and \otimes represent the Kronecker product. The model can then be written as:

\mathbf{y} = \mathbf{\tilde{X}}\boldsymbol{\beta} + \boldsymbol{\upsilon}

The stacked residuals \boldsymbol{\upsilon} is assumed to follow a multivariate normal distribution \mathcal{N}\left(0;\boldsymbol{\Omega}\right), where \boldsymbol{\Omega} can be seen as the sum of K variance components:

\boldsymbol{\Omega} = \sum_{k} \mathbf{R}_k \otimes \mathbf{S}_k

The \boldsymbol{\Omega} here is the same as in Almasy and Blangero (1998). \mathbf{S}_k can be modelled as correlation matrices. The associated scalars are absorbed into the (to be estimated) \mathbf{R}_k. \mathbf{R} is the phenotypic covariance matrix between the residuals of the traits:

\mathbf{R} = \left[  \begin{array}{ccc}  \mathsf{Var}(\upsilon_1) & \cdots & \mathsf{Cov}(\upsilon_1,\upsilon_T) \\  \vdots & \ddots & \vdots \\  \mathsf{Cov}(\upsilon_T,\upsilon_1) & \cdots & \mathsf{Var}(\upsilon_T)  \end{array}\right]

whereas each \mathbf{R}_k are the share of these covariances attributable to the k-th component:

\mathbf{R}_k = \left[  \begin{array}{ccccc}  \mathsf{Var}_k(\upsilon_1) & \cdots & \mathsf{Cov}_k(\upsilon_1,\upsilon_T) \\  \vdots & \ddots & \vdots \\  \mathsf{Cov}_k(\upsilon_T,\upsilon_1) & \cdots & \mathsf{Var}_k(\upsilon_T)  \end{array}\right]

\mathbf{R} can be converted to a between-trait phenotypic correlation matrix \mathbf{\mathring{R}} as:

\mathbf{\mathring{R}} = \left[  \begin{array}{ccc}  \frac{\mathsf{Var}(\upsilon_1)}{\mathsf{Var}(\upsilon_1)} & \cdots &  \frac{\mathsf{Cov}(\upsilon_1,\upsilon_T)}{\left(\mathsf{Var}(\upsilon_1)\mathsf{Var}(\upsilon_T)\right)^{1/2}} \\  \vdots & \ddots & \vdots \\  \frac{\mathsf{Cov}(\upsilon_1,\upsilon_T)}{\left(\mathsf{Var}(\upsilon_1)\mathsf{Var}(\upsilon_T)\right)^{1/2}} & \cdots &  \frac{\mathsf{Var}(\upsilon_T)}{\mathsf{Var}(\upsilon_T)}  \end{array}\right]

The above phenotypic correlation matrix has unit diagonal and can still be fractioned into their K components:

\mathbf{\mathring{R}}_k = \left[  \begin{array}{ccc}  \frac{\mathsf{Var}_k(\upsilon_1)}{\mathsf{Var}(\upsilon_1)} & \cdots &  \frac{\mathsf{Cov}_k(\upsilon_1,\upsilon_T)}{\left(\mathsf{Var}(\upsilon_1)\mathsf{Var}(\upsilon_T)\right)^{1/2}} \\  \vdots & \ddots & \vdots \\  \frac{\mathsf{Cov}_k(\upsilon_T,\upsilon_1)}{\left(\mathsf{Var}(\upsilon_T)\mathsf{Var}(\upsilon_1)\right)^{1/2}} & \cdots &  \frac{\mathsf{Var}_k(\upsilon_T)}{\mathsf{Var}(\upsilon_T)}  \end{array}\right]

The relationship \mathbf{\mathring{R}} = \sum_k \mathbf{\mathring{R}}_k holds. The diagonal elements of \mathbf{\mathring{R}}_k may receive particular names, e.g., heritability, environmentability, dominance effects, shared enviromental effects, etc., depending on what is modelled in the corresponding \mathbf{S}_k. However, the off-diagonal elements of \mathbf{\mathring{R}}_k are not the \rho_k that correspond, e.g. to the genetic or environmental correlation. These off-diagonal elements are instead the signed \text{\textsc{erv}} when \mathbf{S}_k=2\cdot\boldsymbol{\Phi}, or their \text{\textsc{erv}}_k-equivalent for other variance components (see below). In this particular case, they can also be called “bivariate heritabilities” (Falconer and MacKay, 1996). A matrix \mathbf{\breve{R}}_k that contains these correlations \rho_k, which are the fraction of the variance attributable to the k-th component that is shared between pairs of traits is given by:

\mathbf{\breve{R}}_k = \left[  \begin{array}{ccc}  \frac{\mathsf{Var}_k(\upsilon_1)}{\mathsf{Var}_k(\upsilon_1)} & \cdots &  \frac{\mathsf{Cov}_k(\upsilon_1,\upsilon_T)}{\left(\mathsf{Var}_k(\upsilon_1)\mathsf{Var}_k(\upsilon_T)\right)^{1/2}} \\  \vdots & \ddots & \vdots \\  \frac{\mathsf{Cov}_k(\upsilon_T,\upsilon_1)}{\left(\mathsf{Var}_k(\upsilon_T)\mathsf{Var}_k(\upsilon_1)\right)^{1/2}} & \cdots &  \frac{\mathsf{Var}_k(\upsilon_T)}{\mathsf{Var}_k(\upsilon_T)}  \end{array}\right]

As for the phenotypic correlation matrix, each \mathbf{\breve{R}}_k has unit diagonal.

The most common case

A particular case is when \mathbf{S}_1 = 2\cdot\boldsymbol{\Phi}, the coefficient of familial relationship between subjects, and \mathbf{S}_2 = \mathbf{I}_{S \times S}. In this case, the T diagonal elements of \mathbf{\mathring{R}}_1 represent the heritability (h_t^2) for each trait t. The diagonal of \mathbf{\mathring{R}}_2 contains 1-h_t^2, the environmentability. The off-diagonal elements of \mathbf{\mathring{R}}_1 contain the signed \text{\textsc{erv}} (see below). The genetic correlations, \rho_g are the off-diagonal elements of \mathbf{\breve{R}}_1, whereas the off-diagonal elements of \mathbf{\breve{R}}_2 are \rho_e, the environmental correlations between traits. In this particular case, the components of \mathbf{R}, i.e., \mathbf{R}_k are equivalent to \mathbf{G} and \mathbf{E} covariance matrices as in Almasy et al (1997).

Relationship with the ERV

To see how the off-diagonal elements of \mathbf{\mathring{R}}_k are the signed Endophenotypic Ranking Values for each of the k-th variance component, \text{\textsc{erv}}_k (Glahn et al, 2011), note that for a pair of traits i and j:

\mathring{R}_{kij} = \frac{\mathsf{Cov}_k(\upsilon_i,\upsilon_j)}{\left(\mathsf{Var}(\upsilon_i)\mathsf{Var}(\upsilon_j)\right)^{1/2}}

Multiplying both numerator and denominator by \left(\mathsf{Var}_k(\upsilon_i)\mathsf{Var}_k(\upsilon_j)\right)^{1/2} and rearranging the terms gives:

\mathring{R}_{kij} = \frac{\mathsf{Cov}_k(\upsilon_i,\upsilon_j)}{\left(\mathsf{Var}_k(\upsilon_i)\mathsf{Var}_k(\upsilon_j)\right)^{1/2}}  \left(\frac{\mathsf{Var}_k(\upsilon_i)}{\mathsf{Var}(\upsilon_i)}\frac{\mathsf{Var}_k(\upsilon_j)}{\mathsf{Var}(\upsilon_j)}\right)^{1/2}

When \mathbf{S}_k = 2\cdot\boldsymbol{\Phi}, the above reduces to \mathring{R}_{kij} = \rho_k \sqrt{h^2_i h^2_j}, which is the signed version of \text{\textsc{erv}}=\left|\rho_g\sqrt{h_i^2h_j^2}\right| when k is the genetic component.


\mathbf{R} and \mathbf{R}_k are covariance matrices and so, are positive-definite, whereas the correlation matrices \mathbf{\mathring{R}}, \mathbf{\mathring{R}}_k and \mathbf{\breve{R}}_k are positive-semidefinite. A hybrid matrix that does not have to be positive-definite or semidefinite is:

\mathbf{\check{R}}_k = \mathbf{I} \odot \mathbf{\mathring{R}}_k + \left(\mathbf{J}-\mathbf{I}\right) \odot \mathbf{\breve{R}}_k

where \mathbf{J} is a matrix of ones, \mathbf{I} is the identity, both of size T \times T, and \odot is the Hadamard product. An example of such matrix of practical use is to show concisely the heritabilities for each trait in the diagonal and the genetic correlations in the off-diagonal.


Algorithmic advantages can be obtained from the positive-definiteness of \mathbf{\mathring{R}}_k. The Cauchy–Schwarz theorem (Cauchy, 1821; Schwarz, 1888) states that:

\mathring{R}_{kij} \leqslant \sqrt{\mathring{R}_{kii}\mathring{R}_{kjj}}

Hence, the bounds for the off-diagonal elements can be known from the diagonal elements, which, by their turn, are estimated in a simpler, univariate model.

The Cauchy-Schwarz inequality imposes limits on the off-diagonal values of the matrix that contains the genetic covariances (or bivariate heritabilities).

Parameter estimation

Under the multivariate normal assumption, the parameters can be estimated maximising the following loglikelihood function:

\mathcal{L}\left(\mathbf{R}_k,\boldsymbol{\beta}\Big|\mathbf{y},\mathbf{\tilde{X}}\right) = -\frac{1}{2} \left(N \ln 2\pi + \ln \left|\boldsymbol{\Omega}\right| + \left(\mathbf{y}-\mathbf{\tilde{X}}\boldsymbol{\beta}\right)'\boldsymbol{\Omega}\left(\mathbf{y}-\mathbf{\tilde{X}}\boldsymbol{\beta}\right)\right)

where N=S \cdot T is the number of observations on the stacked vector \mathbf{y}. Unbiased estimates for \boldsymbol{\beta}, although inefficient and inappropriate for hypothesis testing, can be obtained with ordinary least squares (OLS).

Parametric inference

Hypothesis testing can be performed with the likelihood ratio test (LRT), i.e., the test statistic is produced by subtracting from the loglikelihood of the model in which all the parameters are free to vary (\mathcal{L}_1), the loglikelihood of a model in which the parameters being tested are constrained to zero, the null model (\mathcal{L}_0). The statistic is given by \lambda = 2\left(\mathcal{L}_1-\mathcal{L}_0\right) (Wilks, 1938), which here is asymptotically distributed as a 50:50 mixture of a \chi^2_0 and \chi^2_{\text{df}} distributions, where df is the number of parameters being tested and free to vary in the unconstrained model (Self and Liang, 1987). From this distribution the p-values can be obtained.


The photograph at the top (elephants) is by Anja Osenberg and was generously released into public domain.

Understanding the kinship matrix

Coefficients to assess the genetic resemblance between individuals were presented in the last post. Among these, the coefficient of kinship, \phi, is probably the most interesting. It gives a probabilistic estimate that a random gene from a given subject i is identical by descent (ibd) to a gene in the same locus from a subject j. For N subjects, these probabilities can be assembled in a N \times N matrix termed kinship matrix, usually represented as \mathbf{\Phi}, that has elements \phi_{ij}, and that can be used to model the covariance between individuals in quantitative genetics.

Consider the pedigree in the figure below, consisted of 14 subjects:

The corresponding kinship matrix, already multiplied by two to indicate expected covariances between subjects (i.e., 2\cdot\mathbf{\Phi}), is:

Note that the diagonal elements can have values above unity, given the consanguineous mating in the family (between s09 and s12, indicated by the double line in the pedigree diagram).

In the next post, details on how the kinship matrix can be used investigate heritabilities, genetic correlations, and to perform association studies will be presented.

Genetic resemblance between relatives

How similar?

The degree of relationship between two related individuals can be estimated by the probability that a gene in one subject is identical by descent to the corresponding gene (i.e., in the same locus) in the other. Two genes are said to be identical by descent (ibd) if both are copies of the same ancestral gene. Genes that are not ibd may still be identical through separate mutations, and be therefore identical by state (ibs), though these will not be considered in what follows.

The coefficients below were introduced by Jacquard in 1970, in a book originally published in French, and translated to English in 1974. A similar content appeared in an article by the same author in the journal Biometrics in 1972 (see the references at the end).

Coefficients of identity

Consider a particular autosomal gene G. Each individual has two copies, one from paternal, another from maternal origin; these can be indicated as G_i^P and G_i^M for individual i. There are 15 exactly distinct ways (states) in which the G can be identical or not identical between two individuals, as shown in the figure below.

To each of these states S_{1, \ldots , 15}, a respective probability \delta_{1, \ldots , 15} can be assigned; these are called coefficients of identity by descent. These probabilities can be calculated at every generation following very elementary rules. For most problems, however, the distinction between paternal and maternal origin of a gene is irrelevant, and some of the above states are equivalent to others. If these are condensed, we can retain 9 distinct ways, shown in the figure below:

As before, to each of these states \Sigma_{1, \ldots , 9}, a respective probability \Delta_{1, \ldots , 9} can be assigned; these are called condensed coefficients of identity by descent, and relate to the former as:

  • \Delta_1 = \delta_1
  • \Delta_2 = \delta_6
  • \Delta_3 = \delta_2 + \delta_3
  • \Delta_4 = \delta_7
  • \Delta_5 = \delta_4 + \delta_5
  • \Delta_6 = \delta_8
  • \Delta_7 = \delta_9 + \delta_{12}
  • \Delta_8 = \delta_{10} + \delta_{11} + \delta_{13} + \delta_{14}
  • \Delta_9 = \delta_{15}

A similar method was proposed by Cotterman (1940), in his highly influential but only much later published doctoral thesis. The \Delta_9, \Delta_8 and \Delta_7 correspond to his coefficients k_0, k_1 and k_2.

Coefficient of kinship

The above refer to probabilities of finding particular genes as identical among subjects. However, a different coefficient can be defined for random genes: the probability that a random gene from subject i is identical with a gene at the same locus from subject j is the coefficient of kinship, and can be represented as \phi_{ij}:

  • \phi_{ij} = \Delta_1 + \frac{1}{2}(\Delta_3 + \Delta_5 + \Delta_7) + \frac{1}{4}\Delta_8

If i and j are in fact the same individual, then \phi_{ii} is the kinship of a subject with himself. Two genes taken from the same individual can either be the same gene (probability \frac{1}{2} of being the same) or be the genes inherited from father and mother, in which case the probability is given by the coefficient of kinship between the parents. In other words, \phi_{ii} = \frac{1}{2} + \frac{1}{2}\phi_{\text{FM}}. If both parents are unrelated, \phi_{\text{FM}}=0, such that the kinship of a subject with himself is \phi_{ii} = \frac{1}{2}.

The value of \phi_{ij} can be determined from the number of generations up to a common ancestor k. A random gene from individual i can be identical to a random gene from individual j in the same locus if both comes from the common ancestor k, an event that can happen if either (1) both are copies of the gene in k, or (2) if they are copies of different genes in k, but k is inbred; this has probability \frac{1}{2}f_k (see below about the coefficient of inbreeding, f). Thus, if there are m generations between i and k, and n generations between j and k, the coefficient of kinship can be computed as \phi_{ij} = \left(\frac{1}{2}\right)^{m+n+1}(1+f_k). If i and j can have more than one common ancestor, then there are more than one line of descent possible, and the kinship is determined by integrating over all such possible K common ancestors:

  • \phi_{ij} = \sum_{k=1}^K \left(\frac{1}{2}\right)^{m_k+n_k+1}(1+f_k)

For a set of subjects, the pairwise coefficients of kinship \phi_{ij} can be arranged in a square matrix \boldsymbol{\Phi}, and used to model the covariance between subjects as 2\cdot\boldsymbol{\Phi} (see here).

Coefficient of inbreeding

The coefficient of inbreeding f of a given subject i is the coefficient of kinship between their parents. While the above coefficients provide information about pairs of individuals, the coefficient of inbreeding gives information about a particular subject. Yet, f_i can be computed from the coefficients of identity:

  • f_{i} = \Delta_1 + \Delta_2 + \Delta_3 + \Delta_4
  • f_{j} = \Delta_1 + \Delta_2 + \Delta_5 + \Delta_6

Note that all these coefficients are based on probabilities, but it is now possible to identify the actual presence of a particular gene using marker data. Also note that while the illustrations above suggest application to livestock, the same applies to studies of human populations.

Some particular cases

The computation of the above coefficients can be done using algorithms, and are done automatically by software that allow analyses of pedigree data, such as solar. Some common particular cases are shown below:

Relationship \Delta_1 \Delta_2 \Delta_3 \Delta_4 \Delta_5 \Delta_6 \Delta_7 \Delta_8 \Delta_9 \phi_{ij}
Self 0 0 0 0 0 0 1 0 0 \frac{1}{2}
Parent-offspring 0 0 0 0 0 0 0 1 0 \frac{1}{4}
Half sibs 0 0 0 0 0 0 0 \frac{1}{2} \frac{1}{2} \frac{1}{8}
Full sibs/dizygotic twins 0 0 0 0 0 0 \frac{1}{4} \frac{1}{2} \frac{1}{4} \frac{1}{4}
Monozygotic twins 0 0 0 0 0 0 1 0 0 \frac{1}{2}
First cousins 0 0 0 0 0 0 0 \frac{1}{4} \frac{3}{4} \frac{1}{16}
Double first cousins 0 0 0 0 0 0 \frac{1}{16} \frac{6}{16} \frac{9}{16} \frac{1}{8}
Second cousins 0 0 0 0 0 0 0 \frac{1}{16} \frac{15}{16} \frac{1}{64}
Uncle-nephew 0 0 0 0 0 0 0 \frac{1}{2} \frac{1}{2} \frac{1}{8}
Offspring of sib-matings \frac{1}{16} \frac{1}{32} \frac{1}{8} \frac{1}{32} \frac{1}{8} \frac{1}{32} \frac{7}{32} \frac{5}{16} \frac{1}{16} \frac{3}{8}


  • Cotterman C. A calculus for statistico-genetics. 1940. PhD Thesis. Ohio State University.
  • Jacquard, A. Structures génétiques des populations. Masson, Paris, France, 1970, later translated and republished as Jacquard, A. The genetic structure of populations. Springer, Heidelberg, 1974.
  • Jacquard A. Genetic information given by a relative. Biometrics. 1972;28(4):1101-1114.

The photograph at the top (sheep) is in public domain.

Inverse normal transformation in SOLAR

SOLAR software can, at the discretion of the user, apply a rank-based inverse-normal transformation to the data using the command inormal. This transformation is the one suggested by Van der Waerden (1952) and is given by:

\tilde y_i = \Phi^{-1}\left\{\dfrac{r_i}{n+1}\right\}

where \tilde y_i is the transformed value for observation i, \Phi^{-1}\left\{\cdot\right\} is the probit function, r_i is the ordinary rank of the i-th case among n observations.

This transformation is a particular case of the family of transformations discussed in the paper by Beasley et al. (2009). The family can be represented as:

\tilde y_i = \Phi^{-1}\left\{\dfrac{r_i+c}{n-2c+1}\right\}

where c is a constant and the remaining variables are as above. The value of c varies for different proposed methods. Blom (1958) suggests c=3/8, Tukey (1962) suggests c=1/3, Bliss (1967) suggests c=1/2 and, as just decribed, Van der Waerden suggests c=0.

Interesting enough, the Q-Q plots produced by Octave use the Bliss (1967) transformation.

An Octave/matlab function to perform these transformations in arbitrary data is here: inormal.m (note that this function does not require or use SOLAR).


Version history

  • 23.Jul.2011: First public release.
  • 19.Jun.2014: Added ability to deal with ties, as well as NaNs in the data.