/
Fs4CjTor07Ssbb69Qcggp5yiBA0/0.4
Abstract
STAGE: WORKING DRAFT
DOCUMENT TYPE: Formal Mathematical Definition
OBJECTIVES:
-
Formal mathematical definition of stochastic process used by statistical estimator of admixture timing under development.
-
Precise mathematical definitions for technical discussions relating to ancestral recombination graphs.
Notation
"" denotes the domain of function
"" is Iverson bracket notation for the indicator function of , namely
Fertilization Function
A haploid lineage process is formally defined in terms of a given fixed fertilization function, which we denote with the symbol . This function maps possible occurrences of fertilization to points in time.
From a given fertilization function , we define three convenient symbols:
: the range of (a set of real numbers representing points in time),
: the domain of (a set of possible diploid organisms), and
: the set representing fertilizing gametes.
For a diploid , the members and of index the fertilizing egg and sperm gametes, respectively.
For convenience, we map to fertilization times with: for all and .
We denote the following inverse images as:
A technical requirement on any given in this document is that must be countable for every .
An example of a valid fertilization function is where .
Haploid lineage process
Given
a fertilization function ,
a set of genomic locations,
and probability space ,
a Haploid lineage process is a stochastic process defined by a time-indexed family of two random variables
Each is a random function from a subset of to . represents the diploid Parent that produced the haploid gamete .
Each is a random function from the domain of to subsets of . represents the genome locations replicated into the gamete from the Paternal haploid genome of the parent (inherited from the father of the parent producing the gamete).
For convenience we define: where .
Haploid lineages
A haploid lineage process induces a random function which maps a genomic location in a descendant haploid genome to a haploid lineage. A haploid lineage consists of all the haploids transmitting genetic information via a genomic location to a descendant haploid. For every genomic location and haploid , its haploid lineage is where is defined inductively as follows: and for integers ,
An embedded Ancestral Recombination Graph
An ancestral recombination graph [1] [2] [3] of a sampled population is embedded in any outcome of any haploid lineage process. We formally show the exact embedding using the gARG formalism [4].
We start by defining the genetic legacy of an ancestral haploid for sample population to be
This genetic legacy is the genetic material that survives in the sample population originally copied from ancestral haploid (with or without mutations).
Genetic legacy for a sample population induces the following equivalence relationship over pairs of haploids and in : We denote the resulting equivalence class containing as
In this equivalence relationship, haploids are considered equivalent if they have the same genetic legacy for the sample population .
A convenient choice for an embedded gARG [4] is to set the gARG nodes (vertices) to be the equivalence classes:
The (unannotated) graph edges of the gARG are chosen as child-parent node pairs where
In the gARG, annotations are added for each graph edge (pair of child and parent nodes). This annotation is the set of locations through which genetic information has been copied from parent to child. In the following interpretation, the only locations of interest are those for which genetic information has been transmitted into the sample population . With this interpretation, the annotation for edge is
Acknowledgements
Thanks to Daria Shipilina and Nick Barton for sharing their preprint [5] and discussing the conjecture in edition 0.1 of this document relating to their preprint.
Changes from edition 0.1
add section about embedded ARG
removed conjecture relating to [5]