Notation

  • "domf\operatorname{dom}f" denotes the domain of function ff

  • "xS⟦ x \in S ⟧" is Iverson bracket notation for the indicator function of SS, namely xS={1if xS0otherwise.⟦ x \in S ⟧ = \begin{cases} 1 & \text{if } x \in S \\ 0 & \text{otherwise.} \end{cases}

Fertilization Function

A haploid lineage process is formally defined in terms of a given fixed fertilization function, which we denote with the symbol Fert\mathrm{Fert}. This function maps possible occurrences of fertilization to points in time.

From a given fertilization function Fert\mathrm{Fert}, we define three convenient symbols:

  • Tim\mathrm{Tim}: the range of Fert\mathrm{Fert} (a set of real numbers representing points in time),

  • Dip\mathrm{Dip}: the domain of Fert\mathrm{Fert} (a set of possible diploid organisms), and

  • Hap\mathrm{Hap}: the set Dip×{0,1}\mathrm{Dip}\times \{0, 1 \} representing fertilizing gametes.

For a diploid dDipd \in \mathrm{Dip}, the members (d,0)(d, 0) and (d,1)(d, 1) of Hap\mathrm{Hap} index the fertilizing egg and sperm gametes, respectively.

For convenience, we map Hap\mathrm{Hap} to fertilization times with: FertH((d,s)):=Fert(d)\mathrm{Fert}_H((d, s)) := \mathrm{Fert}(d) for all dDipd \in \mathrm{Dip} and s{0,1}s \in \{0, 1\}.

We denote the following inverse images as: Dipt:={dDip:Fert(d)=t}Hapt:={dHap:FertH(d)=t}Dip<t:={dDip:Fert(d)<t} .\begin{aligned} \mathrm{Dip}_t & := \{ d \in \mathrm{Dip}: \mathrm{Fert}(d) = t \} \\ \mathrm{Hap}_t & := \{ d \in \mathrm{Hap}: \mathrm{Fert}_H(d) = t \} \\ \mathrm{Dip}_{<t} & := \{ d \in \mathrm{Dip}: \mathrm{Fert}(d) < t \} \text{ .} \end{aligned}

A technical requirement on any given Fert\mathrm{Fert} in this document is that Dipt\mathrm{Dip}_t must be countable for every tTimt \in \mathrm{Tim}.

An example of a valid fertilization function is Fert:N2N\mathrm{Fert}: \mathbf{N}^2 \mapsto \mathbf{N} where Fert((t,n))=t\mathrm{Fert}( (t, n) ) = t.

Haploid lineage process

Given

  • a fertilization function Fert\mathrm{Fert},

  • a set Loc\mathrm{Loc} of genomic locations,

  • and probability space (Ω,F,P)(\Omega, \mathcal{F}, \operatorname{\mathbb{P}}),

a Haploid lineage process is a stochastic process defined by a time-indexed family of two random variables {(Part,Patt)}tTim .\left\{ (\mathrm{Par}_t, \mathrm{Pat}_t) \right\}_{t \in \mathrm{Tim}} \text{ .}

Each Part\mathrm{Par}_t is a random function from a subset of Hapt\mathrm{Hap}_t to Dip<t\mathrm{Dip}_{<t}. Part(h)\mathrm{Par}_t(h) represents the diploid Parent that produced the haploid gamete hh.

Each Patt\mathrm{Pat}_t is a random function from the domain of Part\mathrm{Par}_t to subsets of Loc\mathrm{Loc}. Patt(h)\mathrm{Pat}_t(h) represents the genome locations replicated into the gamete hh from the Paternal haploid genome of the parent Part(h)\mathrm{Par}_t(h) (inherited from the father of the parent producing the gamete).

For convenience we define: Par(h):=Part(h)Pat(h):=Patt(h)\begin{aligned} \mathrm{Par}(h) & := \mathrm{Par}_t(h) \\ \mathrm{Pat}(h) & := \mathrm{Pat}_t(h) \end{aligned} where t=FertH(h)t = \mathrm{Fert}_H(h).

Haploid lineages

A haploid lineage process induces a random function Lin\mathrm{Lin} which maps a genomic location in a descendant haploid genome to a haploid lineage. A haploid lineage consists of all the haploids transmitting genetic information via a genomic location to a descendant haploid. For every genomic location Loc\ell \in \mathrm{Loc} and haploid hHaph \in \mathrm{Hap}, its haploid lineage is Lin(,h):=i{ai}\mathrm{Lin}(\ell, h) := \bigcup_{i} \{ a_i \} where aia_i is defined inductively as follows: a0:=ha_0 := h and for integers i>0i > 0, ai+1:={(Par(ai),Pat(ai))if aidomParaiotherwise.a_{i+1} := \begin{cases} \big( \mathrm{Par}(a_i) , ⟦ \ell \in \mathrm{Pat}(a_i) ⟧ \big) & \text{if } a_i \in \operatorname{dom}\mathrm{Par}\\ a_i & \text{otherwise.} \end{cases}

An embedded Ancestral Recombination Graph

An ancestral recombination graph [1] [2] [3] of a sampled population is embedded in any outcome of any haploid lineage process. We formally show the exact embedding using the gARG formalism [4].

We start by defining the genetic legacy of an ancestral haploid hHaph \in \mathrm{Hap} for sample population SHapS \subseteq \mathrm{Hap} to be Leg(h,S):={(,d)Loc×S:hLin(,d)} .\mathrm{Leg}(h, S) := \{ (\ell, d) \in \mathrm{Loc}\times S : h \in \mathrm{Lin}(\ell, d) \} \text{ .}

This genetic legacy is the genetic material that survives in the sample population SS originally copied from ancestral haploid hh (with or without mutations).

Genetic legacy for a sample population SS induces the following equivalence relationship over pairs of haploids h1h_1 and h2h_2 in Hap\mathrm{Hap}: h1Sh2 := Leg(h1,S)=Leg(h2,S) .h_1 \simeq_S h_2 \ := \ \mathrm{Leg}(h_1, S) = \mathrm{Leg}(h_2, S) \text{ .} We denote the resulting equivalence class containing hHaph \in \mathrm{Hap} as [h]S := {h:Leg(h,S)=Leg(h,S)} .{[h]}_S \ := \ \{ h' : \mathrm{Leg}(h', S) = \mathrm{Leg}(h, S) \} \text{ .}

In this equivalence relationship, haploids are considered equivalent if they have the same genetic legacy for the sample population SS.

A convenient choice for an embedded gARG [4] is to set the gARG nodes (vertices) to be the equivalence classes: Nodes(S):={[h]S:hHap} .\mathrm{Nodes}(S) := \{ {[h]}_S : h \in \mathrm{Hap}\} \text{ .}

The (unannotated) graph edges of the gARG are chosen as child-parent node pairs (C,P)Nodes(S)2(C, P) \in \mathrm{Nodes}(S)^2 where (Par(h),i)P for some hC and some i{0,1} .(\mathrm{Par}(h), i) \in P \text{ for some $h \in C$ and some $i \in \{0, 1\}$ .}

In the gARG, annotations are added for each graph edge (pair of child and parent nodes). This annotation is the set of locations through which genetic information has been copied from parent to child. In the following interpretation, the only locations of interest are those for which genetic information has been transmitted into the sample population SS. With this interpretation, the annotation for edge (C,P)(C,P) is {Loc:CLin(,h) and PLin(,h) for some hS} .\{ \ell \in \mathrm{Loc}: C \subseteq \mathrm{Lin}(\ell, h) \text{ and } P \subseteq \mathrm{Lin}(\ell, h) \text{ for some $h \in S$} \} \text{ .}

Acknowledgements

Thanks to Daria Shipilina and Nick Barton for sharing their preprint [5] and discussing the conjecture in edition 0.1 of this document relating to their preprint.

Changes from edition 0.1

  • add section about embedded ARG

  • removed conjecture relating to [5]

References

1.
Griffiths RC, Marjoram P. An Ancestral Recombination Graph. In: Friedman A, Miller W, Donnelly P, Tavaré S, editors. Progress in Population Genetics and Human Evolution. New York, NY: Springer New York; 1997. pp. 257–270.
2.
Hein J, Schierup MH, Wiuf C. Gene genealogies, variation and evolution: A primer in coalescent theory. Oxford ; New York: Oxford University Press; 2005.
3.
Wakeley J. Coalescent theory: An introduction. Greenwood Village, Colo: Roberts & Co. Publishers; 2009.
4.
Wong Y, Ignatieva A, Koskela J, Gorjanc G, Wohns AW, Kelleher J. A general and efficient representation of ancestral recombination graphs. https://archive.softwareheritage.org/swh:1:rev:7df4f1995028cc676a6c1b231e8d7a024666b5fc; 2022.
5.
Shipilina D, Stankowski S, Pal A, Chan YF, Barton N. On the origin and structure of haplotype blocks. Preprints; 2022 Feb. doi:10.22541/au.164425910.09070763/v1