Gametic genealogy
A gametic genealogy is a convenient mathematical formalism of the genealogy
of a population from the perspective of gametes. Mathematically, it is a quadruple
(Gam,Mate,Par,Fert)
with components
Gam, the set of underlying gametes,
Mate, the set of zygotes formed by the fusion of egg gametes and sperm gametes,
Par, a mapping from child gametes to parent zygotes, and
Fert, a mapping from zygotes to fertilization time.
For convenience, given a gametic genealogy,
Gam0 denotes the set of egg gametes,
Gam1 denotes the set of sperm gametes, and
Mate∗ denotes the mapping from gametes to the zygotes they formed during
fertilization.
Formally, a gametic genealogy must satisfy the following conditions.
Gam0∪Gam1=Gam and Gam0∩Gam1=∅.
Mate⊂Gam0×Gam1 and
forms a one-to-one mapping between Gam0 and Gam1.
Par is a function C↦Mate, where C is a subset of Gam representing
child gametes.
Fert is a function Mate↦R such that for all child gametes
g∈domPar,
Fert(Mate∗(g))>Fert(Par(g)) .
domPar denotes the domain of Par, that is, the set of child gametes.
Gametic lineage space
A gametic lineage space is a mathematical formalism representing the lines of
transmission of genetic information via gametes of a population over time.
It is a triplet
(Loc,G,Lin)
where
Loc is the set of all genomic locations,
G is a gametic genealogy (Gam,Mate,Par,Fert), and
Lin is a function Loc×Gam↦2Gam mapping a genomic position
in a gamete to the set of gametes that transmitted genetic information to that
position in that gamete.
For every location ℓ∈Loc and gamete g∈Gam, Lin(ℓ,g) is
the lineage ending at gamete g via locus ℓ and it must satisfy the
condition
Lin(ℓ,g)={g}∪Lin(ℓ,Par(g)i) for either i=0 or i=1
when g∈domPar, otherwise Lin(ℓ,g)={g}.
Par(g)0 and Par(g)1 are the maternal and paternal gametes,
respectively, that fertilized the parent of g.
Stochastic gametic lineage space
A stochastic gametic lineage space is a gametic lineage space extended to model a
random gametic lineage.
From a stochastic gametic linage space, a time-indexed family of probability
distributions {Pt}t∈I is induced.
TO DO:
Need to rework space formalism to clarify over what is the σ-algebra:
issues #42.
For convenience we define the set of all lineages that contain a gamete g∈Gam as
B(g):={Lin(ℓ,s):g∈Lin(ℓ,s),ℓ∈Loc,s∈Gam} .
Formally, a stochastic gametic lineage space is a quintuple
(G,I,{Si}i∈I,F,μ)
where
G is a gametic lineage space (Loc,(Gam,Mate,Par,Fert),Lin)
I is an index set of points in time with rngFert⊂I,
{St}t∈I is a time-indexed collection of sets of living zygotes,
F is a σ-algebra (sigma-field) over rngLin, and
μ is a measure on F
which satisfy the following conditions
B(g)∈F for all g∈Gam, and
μ(St) is defined and finite for all t∈I.
Every gametic lineage space induces a time-indexed family
{Pt}t∈I of probabilities spaces measurable on σ-algebra F.
This defines the probability of lineages which end in a zygote alive at time t.
TO DO:
Need to clarify relationship between F and Loc and Mate for when they are
uncountable.
An embedded Ancestral Recombination Graph
An ancestral recombination graph
[1] [2] [3]
of a sampled population is embedded in a gametic lineage space.
We formally show the exact embedding using the gARG formalism [4].
We start by defining the genetic legacy of a gamete g∈Gam for sample population
S⊆Gam to be
Leg(g,S):={(ℓ,d)∈Loc×S:g∈Lin(ℓ,d)} .
This genetic legacy is the genetic material that survives in the sample
population S originally copied from ancestral gamete g (with or without
mutations).
Genetic legacy for a sample population S induces the following equivalence
relationship over pairs of gametes g1 and g2 in Gam:
g1≃Sg2 := Leg(g1,S)=Leg(g2,S) .
We denote the resulting equivalence class containing g∈Gam as
[g]S := {g′:Leg(g′,S)=Leg(g,S)} .
In this equivalence relationship, gametes are considered equivalent if
they have the same genetic legacy for the sample population S.
A convenient choice for an embedded gARG [4] is to set the gARG
nodes (vertices) to be the equivalence classes:
Nodes(S):={[g]S:g∈Gam} .
The (unannotated) graph edges of the gARG are chosen as child-parent node pairs
(C,P)∈Nodes(S)2 where
Par(g)i∈P for some g∈C and some i∈{0,1} .
In the gARG, annotations are added for each graph edge (pair of child and parent nodes).
This annotation is the set of locations through which genetic information has been
copied from parent to child. In the following interpretation, the only locations of
interest are those for which genetic information has been transmitted into
the sample population S.
With this interpretation, the annotation for edge (C,P) is
{ℓ∈Loc:C∪P⊆Lin(ℓ,g) for some g∈S} .
Acknowledgements
Thanks to Daria Shipilina and Nick Barton for sharing their preprint
[5]
and discussing the conjecture in edition 0.1 of this document relating to their
preprint.
Changes from edition 0.1
References
1.
Griffiths RC, Marjoram P (1997)
An Ancestral Recombination Graph. In: Friedman A, Miller W, Donnelly P, Tavaré S (eds) Progress in Population Genetics and Human Evolution. Springer New York, New York, NY, pp 257–270
2.
Hein J, Schierup MH, Wiuf C (2005) Gene genealogies, variation and evolution: A primer in coalescent theory. Oxford University Press, Oxford ; New York
3.
Wakeley J (2009) Coalescent theory: An introduction. Roberts & Co. Publishers, Greenwood Village, Colo
4.
Wong Y, Ignatieva A, Koskela J, et al (2022) A general and efficient representation of ancestral recombination graphs. https://archive.softwareheritage.org/swh:1:rev:7df4f1995028cc676a6c1b231e8d7a024666b5fc