## Gametic genealogy

A gametic genealogy is a convenient mathematical formalism of the genealogy of a population from the perspective of gametes. Mathematically, it is a quadruple $(\mathsf{Gam}, \mathsf{Mate}, \mathsf{Par}, \mathsf{Fert})$ with components

• $\mathsf{Gam}$, the set of underlying gametes,

• $\mathsf{Mate}$, the set of zygotes formed by the fusion of egg gametes and sperm gametes,

• $\mathsf{Par}$, a mapping from child gametes to parent zygotes, and

• $\mathsf{Fert}$, a mapping from zygotes to fertilization time.

For convenience, given a gametic genealogy,

• $\mathsf{Gam}_0$ denotes the set of egg gametes,

• $\mathsf{Gam}_1$ denotes the set of sperm gametes, and

• $\mathsf{Mate}_*$ denotes the mapping from gametes to the zygotes they formed during fertilization.

Formally, a gametic genealogy must satisfy the following conditions.

1. $\mathsf{Gam}_0 \cup \mathsf{Gam}_1 = \mathsf{Gam}$ and $\mathsf{Gam}_0 \cap \mathsf{Gam}_1 = \emptyset$.

2. $\mathsf{Mate}\subset \mathsf{Gam}_0 \times \mathsf{Gam}_1$ and forms a one-to-one mapping between $\mathsf{Gam}_0$ and $\mathsf{Gam}_1$.

3. $\mathsf{Par}$ is a function $C \mapsto \mathsf{Mate}$, where $C$ is a subset of $\mathsf{Gam}$ representing child gametes.

4. $\mathsf{Fert}$ is a function $\mathsf{Mate}\mapsto \mathbb{R}$ such that for all child gametes $g \in \operatorname{dom}\mathsf{Par}$, $\mathsf{Fert}(\mathsf{Mate}_*(g)) > \mathsf{Fert}(\mathsf{Par}(g)) \text{ .}$ $\operatorname{dom}\mathsf{Par}$ denotes the domain of $\mathsf{Par}$, that is, the set of child gametes.

## Gametic lineage space

A gametic lineage space is a mathematical formalism representing the lines of transmission of genetic information via gametes of a population over time. It is a triplet $(\mathsf{Loc}, G, \mathsf{Lin})$ where

• $\mathsf{Loc}$ is the set of all genomic locations,

• $G$ is a gametic genealogy $(\mathsf{Gam}, \mathsf{Mate}, \mathsf{Par}, \mathsf{Fert})$, and

• $\mathsf{Lin}$ is a function $\mathsf{Loc}\times \mathsf{Gam}\mapsto 2^\mathsf{Gam}$ mapping a genomic position in a gamete to the set of gametes that transmitted genetic information to that position in that gamete.

For every location $\ell \in \mathsf{Loc}$ and gamete $g \in \mathsf{Gam}$, $\mathsf{Lin}(\ell, g)$ is the lineage ending at gamete $g$ via locus $\ell$ and it must satisfy the condition $\mathsf{Lin}(\ell, g) = \{g\} \cup \mathsf{Lin}(\ell, \mathsf{Par}(g)_i) \text{ for either i=0 or i=1}$ when $g \in \operatorname{dom}\mathsf{Par}$, otherwise $\mathsf{Lin}(\ell, g) = \{g\}$.

$\mathsf{Par}(g)_0$ and $\mathsf{Par}(g)_1$ are the maternal and paternal gametes, respectively, that fertilized the parent of $g$.

## An Embedded Ancestral Recombination Graph

An ancestral recombination graph    of a sampled population is embedded in a gametic lineage space. We formally show the exact embedding using the gARG formalism .

We start by defining the genetic legacy of a gamete $g \in \mathsf{Gam}$ for sample population $S \subseteq \mathsf{Gam}$ to be $\mathsf{Leg}(g, S) := \{ (\ell, d) \in \mathsf{Loc}\times S : g \in \mathsf{Lin}(\ell, d) \} \text{ .}$ This genetic legacy is the genetic material that survives in the sample population $S$ originally copied from ancestral gamete $g$ (with or without mutations).

QUESTIONS FOR FEEDBACK: Would "gametic legacy" be a more useful wording than "genetic legacy"? Would some word other than "legacy" be more clear?

Genetic legacy for a sample population $S$ induces the following equivalence relationship over pairs of gametes $g_1$ and $g_2$ in $\mathsf{Gam}$: $g_1 \simeq_S g_2 \ := \ \mathsf{Leg}(g_1, S) = \mathsf{Leg}(g_2, S) \text{ .}$ We denote the resulting equivalence class containing $g \in \mathsf{Gam}$ as ${[g]}_S \ := \ \{ g' : \mathsf{Leg}(g', S) = \mathsf{Leg}(g, S) \} \text{ .}$

In this equivalence relationship, gametes are considered equivalent if they have the same genetic legacy for the sample population $S$.

A convenient choice for an embedded gARG  is to set the gARG nodes (vertices) to be the equivalence classes: $\mathsf{Nodes}(S) := \{ {[g]}_S : g \in \mathsf{Gam}\} \text{ .}$

The (unannotated) graph edges of the gARG are chosen as child-parent node pairs $(C, P) \in \mathsf{Nodes}(S)^2$ where $\mathsf{Par}(g)_i \in P \text{ for some g \in C and some i \in \{0, 1\} .}$

In the gARG, annotations are added for each graph edge (pair of child and parent nodes). This annotation is the set of locations through which genetic information has been copied from parent to child. In the following interpretation, the only locations of interest are those for which genetic information has been transmitted into the sample population $S$. With this interpretation, the annotation for edge $(C,P)$ is $\{ \ell \in \mathsf{Loc}: C \cup P \subseteq \mathsf{Lin}(\ell, g) \text{ for some g \in S} \} \text{ .}$

## Acknowledgements

Thanks to Daria Shipilina and Nick Barton for sharing their preprint  and discussing the conjecture in edition 0.1 of this document relating to their preprint.