Gametic genealogy

A gametic genealogy is a convenient mathematical formalism of the genealogy of a population from the perspective of gametes. Mathematically, it is a quadruple (Gam,Mate,Par,Fert) (\mathsf{Gam}, \mathsf{Mate}, \mathsf{Par}, \mathsf{Fert}) with components

  • Gam\mathsf{Gam}, the set of underlying gametes,

  • Mate\mathsf{Mate}, the set of zygotes formed by the fusion of egg gametes and sperm gametes,

  • Par\mathsf{Par}, a mapping from child gametes to parent zygotes, and

  • Fert\mathsf{Fert}, a mapping from zygotes to fertilization time.

For convenience, given a gametic genealogy,

  • Gam0\mathsf{Gam}_0 denotes the set of egg gametes,

  • Gam1\mathsf{Gam}_1 denotes the set of sperm gametes, and

  • Mate\mathsf{Mate}_* denotes the mapping from gametes to the zygotes they formed during fertilization.

Formally, a gametic genealogy must satisfy the following conditions.

  1. Gam0Gam1=Gam\mathsf{Gam}_0 \cup \mathsf{Gam}_1 = \mathsf{Gam} and Gam0Gam1=\mathsf{Gam}_0 \cap \mathsf{Gam}_1 = \emptyset.

  2. MateGam0×Gam1\mathsf{Mate}\subset \mathsf{Gam}_0 \times \mathsf{Gam}_1 and forms a one-to-one mapping between Gam0\mathsf{Gam}_0 and Gam1\mathsf{Gam}_1.

  3. Par\mathsf{Par} is a function CMateC \mapsto \mathsf{Mate}, where CC is a subset of Gam\mathsf{Gam} representing child gametes.

  4. Fert\mathsf{Fert} is a function MateR\mathsf{Mate}\mapsto \mathbb{R} such that for all child gametes gdomParg \in \operatorname{dom}\mathsf{Par}, Fert(Mate(g))>Fert(Par(g)) . \mathsf{Fert}(\mathsf{Mate}_*(g)) > \mathsf{Fert}(\mathsf{Par}(g)) \text{ .} domPar\operatorname{dom}\mathsf{Par} denotes the domain of Par\mathsf{Par}, that is, the set of child gametes.

Gametic lineage space

A gametic lineage space is a mathematical formalism representing the lines of transmission of genetic information via gametes of a population over time. It is a triplet (Loc,G,Lin) (\mathsf{Loc}, G, \mathsf{Lin}) where

  • Loc\mathsf{Loc} is the set of all genomic locations,

  • GG is a gametic genealogy (Gam,Mate,Par,Fert)(\mathsf{Gam}, \mathsf{Mate}, \mathsf{Par}, \mathsf{Fert}), and

  • Lin\mathsf{Lin} is a function Loc×Gam2Gam\mathsf{Loc}\times \mathsf{Gam}\mapsto 2^\mathsf{Gam} mapping a genomic position in a gamete to the set of gametes that transmitted genetic information to that position in that gamete.

For every location Loc\ell \in \mathsf{Loc} and gamete gGamg \in \mathsf{Gam}, Lin(,g)\mathsf{Lin}(\ell, g) is the lineage ending at gamete gg via locus \ell and it must satisfy the condition Lin(,g)={g}Lin(,Par(g)i) for either i=0 or i=1 \mathsf{Lin}(\ell, g) = \{g\} \cup \mathsf{Lin}(\ell, \mathsf{Par}(g)_i) \text{ for either $i=0$ or $i=1$} when gdomParg \in \operatorname{dom}\mathsf{Par}, otherwise Lin(,g)={g}\mathsf{Lin}(\ell, g) = \{g\}.

Par(g)0\mathsf{Par}(g)_0 and Par(g)1\mathsf{Par}(g)_1 are the maternal and paternal gametes, respectively, that fertilized the parent of gg.

Stochastic gametic lineage space

A stochastic gametic lineage space is a gametic lineage space extended to model a random gametic lineage.

From a stochastic gametic linage space, a time-indexed family of probability distributions {Pt}tI\{ P_t \}_{t \in I} is induced.

TO DO: Need to rework space formalism to clarify over what is the σ\sigma-algebra: issues #42.

For convenience we define the set of all lineages that contain a gamete gGamg \in \mathsf{Gam} as B(g):={Lin(,s):gLin(,s),Loc,sGam} . B(g) := \{ \mathsf{Lin}(\ell, s) : g \in \mathsf{Lin}(\ell,s), \ell \in \mathsf{Loc}, s \in \mathsf{Gam}\} \text{ .}

Formally, a stochastic gametic lineage space is a quintuple (G,I,{Si}iI,F,μ) (G, I, \{ S_i \}_{i \in I}, \mathcal{F}, \mu) where

  • GG is a gametic lineage space (Loc,(Gam,Mate,Par,Fert),Lin)(\mathsf{Loc}, (\mathsf{Gam}, \mathsf{Mate}, \mathsf{Par}, \mathsf{Fert}), \mathsf{Lin})

  • II is an index set of points in time with rngFertI\operatorname{rng}\mathsf{Fert}\subset I,

  • {St}tI\{S_t\}_{t \in I} is a time-indexed collection of sets of living zygotes,

  • F\mathcal{F} is a σ\sigma-algebra (sigmasigma-field) over rngLin\operatorname{rng}\mathsf{Lin}, and

  • μ\mu is a measure on F\mathcal{F}

which satisfy the following conditions

  • B(g)FB(g) \in \mathcal{F} for all gGamg \in \mathsf{Gam}, and

  • μ(St)\mu(S_t) is defined and finite for all tIt \in I.

Every gametic lineage space induces a time-indexed family {Pt}tI\{P_t\}_{t \in I} of probabilities spaces measurable on σ\sigma-algebra FF. This defines the probability of lineages which end in a zygote alive at time tt.

TO DO: Need to clarify relationship between FF and Loc\mathsf{Loc} and Mate\mathsf{Mate} for when they are uncountable.

An embedded Ancestral Recombination Graph

An ancestral recombination graph [1] [2] [3] of a sampled population is embedded in a gametic lineage space. We formally show the exact embedding using the gARG formalism [4].

We start by defining the genetic legacy of a gamete gGamg \in \mathsf{Gam} for sample population SGamS \subseteq \mathsf{Gam} to be Leg(g,S):={(,d)Loc×S:gLin(,d)} . \mathsf{Leg}(g, S) := \{ (\ell, d) \in \mathsf{Loc}\times S : g \in \mathsf{Lin}(\ell, d) \} \text{ .} This genetic legacy is the genetic material that survives in the sample population SS originally copied from ancestral gamete gg (with or without mutations).

Genetic legacy for a sample population SS induces the following equivalence relationship over pairs of gametes g1g_1 and g2g_2 in Gam\mathsf{Gam}: g1Sg2 := Leg(g1,S)=Leg(g2,S) . g_1 \simeq_S g_2 \ := \ \mathsf{Leg}(g_1, S) = \mathsf{Leg}(g_2, S) \text{ .} We denote the resulting equivalence class containing gGamg \in \mathsf{Gam} as [g]S := {g:Leg(g,S)=Leg(g,S)} . {[g]}_S \ := \ \{ g' : \mathsf{Leg}(g', S) = \mathsf{Leg}(g, S) \} \text{ .}

In this equivalence relationship, gametes are considered equivalent if they have the same genetic legacy for the sample population SS.

A convenient choice for an embedded gARG [4] is to set the gARG nodes (vertices) to be the equivalence classes: Nodes(S):={[g]S:gGam} . \mathsf{Nodes}(S) := \{ {[g]}_S : g \in \mathsf{Gam}\} \text{ .}

The (unannotated) graph edges of the gARG are chosen as child-parent node pairs (C,P)Nodes(S)2(C, P) \in \mathsf{Nodes}(S)^2 where Par(g)iP for some gC and some i{0,1} . \mathsf{Par}(g)_i \in P \text{ for some $g \in C$ and some $i \in \{0, 1\}$ .}

In the gARG, annotations are added for each graph edge (pair of child and parent nodes). This annotation is the set of locations through which genetic information has been copied from parent to child. In the following interpretation, the only locations of interest are those for which genetic information has been transmitted into the sample population SS. With this interpretation, the annotation for edge (C,P)(C,P) is {Loc:CPLin(,g) for some gS} . \{ \ell \in \mathsf{Loc}: C \cup P \subseteq \mathsf{Lin}(\ell, g) \text{ for some $g \in S$} \} \text{ .}

Acknowledgements

Thanks to Daria Shipilina and Nick Barton for sharing their preprint [5] and discussing the conjecture in edition 0.1 of this document relating to their preprint.

Changes from edition 0.1

  • add section about embedded ARG

  • removed conjecture relating to [5]

References

1.
Griffiths RC, Marjoram P. An Ancestral Recombination Graph. In: Friedman A, Miller W, Donnelly P, Tavaré S, editors. Progress in Population Genetics and Human Evolution. New York, NY: Springer New York; 1997. pp. 257–270. doi:10.1007/978-1-4757-2609-1_16
2.
Hein J, Schierup MH, Wiuf C. Gene genealogies, variation and evolution: A primer in coalescent theory. Oxford ; New York: Oxford University Press; 2005.
3.
Wakeley J. Coalescent theory: An introduction. Greenwood Village, Colo: Roberts & Co. Publishers; 2009.
4.
Wong Y, Ignatieva A, Koskela J, Gorjanc G, Wohns AW, Kelleher J. A general and efficient representation of ancestral recombination graphs. https://archive.softwareheritage.org/swh:1:rev:7df4f1995028cc676a6c1b231e8d7a024666b5fc; 2022.
5.
Shipilina D, Stankowski S, Pal A, Chan YF, Barton N. On the origin and structure of haplotype blocks. Preprints; 2022 Feb. doi:10.22541/au.164425910.09070763/v1