Diversity analyses

Elementary classes

Filter

class

Holds lists of valid (exploitable and missing) data codes.

This class holds two list of integer values: one corresponding to data codes that should be treated as valid and exploitable, and one corresponding to data codes that should be treated as valid but missing. If the list of exploitable data is empty, all data are considered to be exploitable (even those who are missing).

Header: <egglib-cpp/Filter.hpp>

Public Functions

egglib::Filter::Filter()

Constructor.

egglib::Filter::Filter(const Filter &src)

Copy constructor.

virtual egglib::Filter::~Filter()

Destructor.

void egglib::Filter::add_exploitable(int code)

Add a exploitable data code.

By default (that is, if no exploitable data codes have been entered), all data values are considered to be exploitable, even if they appear in missing. If one or more data have been set as exploitable, only those will be considered to be exploitable.

void egglib::Filter::add_exploitable_range(int first, int last)

Add a range of exploitable data codes.

Like add_exploitable(int), but add all values included between first and last (both included). It is not required at all that all exploitable codes have a synonym.

void egglib::Filter::add_exploitable_with_alias(int code, int alias)

Add a exploitable data code with synonym character.

Similar to add_exploitable(int), except that a second parameter is passed to specify a synonym for the main code. If a data item matching alias is passed to is_exploitable(), it will be replaced by code and the function will return true. For a large range of continguous codes, do not use this method within a loop and use the appropriate and more efficient method add_exploitable_range(int, int).

void egglib::Filter::add_missing(int code)

Add a missing data code.

void egglib::Filter::add_missing_with_alias(int code, int alias)

Add a missing data code with synonym character.

Similar to add_missing(int), except that a second parameter is passed to specify a synonym for the main code. If a data item matching alias is passed to is_missing(int), it will be replaced by code and the function will return true.

int egglib::Filter::check(int code, bool &flag)
const

Check data value.

Return: (i) the allelic value if it is exploitable as is; (ii) the main value if the provided value is a synonym for an exploitable value; (iii) MISSINGDATA if it is one of the missing data codes or synonyms; (iv) MISSINGDATA if the value is invalid (and in addition the flag argument is set to true).

void egglib::Filter::clear()

Reset instance.

The instance is delivered as newly created, and memory is actually released.

bool egglib::Filter::is_exploitable(int &code)
const

Check if a data value is exploitable.

If the list of exploitable data is empty, then this method always returns true. Otherwise, it checks if the value matches one of the value in the exploitable list. If the value does not match any of the exploitable list, but matches one of the synonyms entered using add_exploitable(int, int), then the method returns true and modifies the passed value (so that the synonym is modified to the reference exploitable code). Otherwise, it never modified the value

bool egglib::Filter::is_missing(int &code)
const

Check if a data value is missing.

Returns true if the code matches one of the missing data codes. Otherwise, returns true and modifies the value if the code matches one of the synonyms. If also not, return false.

Filter &egglib::Filter::operator=(const Filter &src)

Copy assignment operator.

void egglib::Filter::reserve(unsigned int num_expl, unsigned int num_expl_ranges, unsigned int num_missing, unsigned int num_synonyms, unsigned int num_missing_synonyms)

Reserve memory.

The method pre-allocates data arrays in order to speed up subsequent loading operations (up to the numbers passed). The instance is not formally changed by this method, and it is absolutely not required to call this method prior setting valid or missing data codes.

Parameters
  • num_expl -

    expected number of exploitable data codes.

  • num_expl_ranges -

    expected number of ranges of exploitable data codes.

  • num_missing -

    expected number of missing data codes.

  • num_synonyms -

    expected number of synonyms (it is not required that all exploitable codes have a synonym).

  • num_missing_synonyms -

    expected number of synonym for missing data (it is not required that all missing codes have a synonym).

Structure

class

Manage hierarchical group structure.

Public Functions

StructureCluster *egglib::StructureHolder::add_cluster(unsigned int label)

Add a cluster with no samples in it.

StructureIndiv *egglib::StructureHolder::add_individual_ingroup(unsigned int label, StructureCluster *cluster, StructurePopulation *population)

Add an ingroup individual with no samples in it.

StructureIndiv *egglib::StructureHolder::add_individual_outgroup(unsigned int label)

Add an outgroup individual with no samples in it.

void egglib::StructureHolder::add_pop_filter(unsigned int lbl)

Add a population label to filter.

If at least one is passed, process only those passed. By default, include all populations. Use reset_filter() to reset to default.

StructurePopulation *egglib::StructureHolder::add_population(unsigned int label, StructureCluster *cluster)

Add a population with no samples in it.

void egglib::StructureHolder::add_sample_ingroup(unsigned int sam_idx, StructureCluster *cluster, StructurePopulation *population, StructureIndiv *indiv)

Add one ingroup sample.

void egglib::StructureHolder::add_sample_outgroup(unsigned int sam_idx, StructureIndiv *indiv)

Add one outgroup sample.

void egglib::StructureHolder::check_ploidy(unsigned int value)

Ensure ploidy is consistent and optionally equal to passed value.

Automatically called by get_structure(). Need to be called if process_ingroup() and/or process_outgroup() is used.

Value must be >0.

void egglib::StructureHolder::copy(const StructureHolder &source)

Copy data frome source object.

const StructureCluster &egglib::StructureHolder::get_cluster(unsigned int idx)
const

Get a cluster.

const StructureIndiv &egglib::StructureHolder::get_indiv_ingroup(unsigned int idx)
const

Get an ingroup individual.

const StructureIndiv &egglib::StructureHolder::get_indiv_outgroup(unsigned int idx)
const

Get an outgroup individual.

unsigned int egglib::StructureHolder::get_no()
const

Get number of outgroup samples.

unsigned int egglib::StructureHolder::get_no_req()
const

Get required number of outgroup samples.

unsigned int egglib::StructureHolder::get_ns()
const

Get number of ingroup samples.

unsigned int egglib::StructureHolder::get_ns_req()
const

Get required number of ingroup samples.

unsigned int egglib::StructureHolder::get_ploidy()
const

Get ploidy.

Default is UNKNOWN.

unsigned int egglib::StructureHolder::get_pop_index(unsigned int)
const

Index of the population containing this sample (default: MISSING).

const StructurePopulation &egglib::StructureHolder::get_population(unsigned int idx)
const

Get a population.

void egglib::StructureHolder::get_structure(DataHolder &data, unsigned int lvl_clust, unsigned int lvl_pop, unsigned int lvl_indiv, unsigned int ploidy, bool skip_outgroup)

Process labels from a DataHolder.

Use UNKNOWN for any level to skip (but skipping individuals is not the same as skipping clusters/pops). You must set pop filter separately.

unsigned int egglib::StructureHolder::num_clust()
const

Number of clusters.

unsigned int egglib::StructureHolder::num_indiv_ingroup()
const

Number of ingroup individuals (total).

unsigned int egglib::StructureHolder::num_indiv_outgroup()
const

Number of outgroup individuals.

unsigned int egglib::StructureHolder::num_pop()
const

Number of populations (total).

void egglib::StructureHolder::process_ingroup(unsigned int idx, unsigned int lbl_clust, unsigned int lbl_pop, unsigned int lbl_indiv)

Process one ingroup sample.

void egglib::StructureHolder::process_outgroup(unsigned int idx, unsigned int lbl_indiv)

Process one outgroup sample.

void egglib::StructureHolder::reserve_filter(unsigned int howmany)

Pre-alloc filter table.

void egglib::StructureHolder::reset()

Reset to defaults.

void egglib::StructureHolder::reset_filter()

Reset pop filter only.

class

Manage a cluster.

Public Functions

StructureIndiv *egglib::StructureCluster::add_indiv(StructurePopulation *pop, unsigned int label)

Add and create an individual.

StructurePopulation *egglib::StructureCluster::add_pop(unsigned int label)

Add and create a population.

void egglib::StructureCluster::add_sample()

Add a sample.

const StructureIndiv &egglib::StructureCluster::get_indiv(unsigned int idx)
const

Get an individual.

unsigned int egglib::StructureCluster::get_label()
const

Get label.

const StructurePopulation &egglib::StructureCluster::get_population(unsigned int idx)
const

Get a population.

unsigned int egglib::StructureCluster::num_indiv()
const

Number of individuals (total for this cluster).

unsigned int egglib::StructureCluster::num_pop()
const

Number of populations.

void egglib::StructureCluster::reset(StructureHolder *holder, unsigned int label)

Restore defaults.

class

Manage a population.

Public Functions

StructureIndiv *egglib::StructurePopulation::add_indiv(unsigned int label)

Add and create an individual.

void egglib::StructurePopulation::add_sample()

Add a sample.

StructureCluster *egglib::StructurePopulation::get_cluster()

Get containing cluster.

const StructureIndiv &egglib::StructurePopulation::get_indiv(unsigned int idx)
const

Get an individual.

unsigned int egglib::StructurePopulation::get_label()
const

Get label.

unsigned int egglib::StructurePopulation::num_indiv()
const

Number of individuals.

void egglib::StructurePopulation::reset(StructureHolder *holder, StructureCluster *cluster, unsigned int label)

Restore defaults.

class

Manage an individual.

Public Functions

void egglib::StructureIndiv::add_sample(unsigned int index)

Add a sample.

StructureCluster *egglib::StructureIndiv::get_cluster()

Get containing cluster (NULL if outgroup).

unsigned int egglib::StructureIndiv::get_label()
const

Get label.

StructurePopulation *egglib::StructureIndiv::get_population()

Get containing population (NULL if outgroup).

unsigned int egglib::StructureIndiv::get_sample(unsigned int idx)
const

Get a sample.

unsigned int egglib::StructureIndiv::num_samples()
const

Number of samples.

void egglib::StructureIndiv::reset(StructureHolder *holder, StructureCluster *cluster, StructurePopulation *population, unsigned int label)

Restore defaults.

Classes analysing frequencies

FreqHolder

class

Class holding frequencies for all compartments for a site.

Possible uses of this class:

  • Process a site with structure stored in a StructureHolder:
    • setup_structure(structure, ploidy, flag) and keep structure available
    • process_site()
  • Process a site without structure or with a manual structure:
    • setup_raw(nc, np, no, ploidy, flag)
    • setup_pop(i, cluster, relative, ns) for all populations
    • process_site() assuming all individuals are consecutive
  • Enter frequencies manually:
  • Process data from a VCF parser:

Public Functions

egglib::FreqHolder::FreqHolder()

Constructor.

egglib::FreqHolder::~FreqHolder()

Destructor.

int egglib::FreqHolder::allele(unsigned int)
const

Get an allele value.

unsigned int egglib::FreqHolder::cluster_index(unsigned int)
const

Get the index of the cluster of a given population.

const FreqSet &egglib::FreqHolder::frq_cluster(unsigned int)
const

Get frequencies in a cluster.

const FreqSet &egglib::FreqHolder::frq_ingroup()
const

Get frequencies in whole ingroup.

const FreqSet &egglib::FreqHolder::frq_outgroup()
const

Get frequencies in outgroup.

const FreqSet &egglib::FreqHolder::frq_population(unsigned int)
const

Get frequencies in a population.

const unsigned int *egglib::FreqHolder::genotype(unsigned int)
const

Get a genotype (as array of allele indexes) (none if haploid)

bool egglib::FreqHolder::genotype_het(unsigned int)
const

True if genotype is heterozygote.

unsigned int egglib::FreqHolder::genotype_item(unsigned int, unsigned int)
const

Get part of a genotype.

unsigned int egglib::FreqHolder::num_alleles()
const

Number of alleles.

unsigned int egglib::FreqHolder::num_clusters()
const

Get number of clusters.

unsigned int egglib::FreqHolder::num_genotypes()
const

Number of genotypes with non-null frequency (0 if haploid)

unsigned int egglib::FreqHolder::num_populations()
const

Get number of populations.

unsigned int egglib::FreqHolder::ploidy()
const

Ploidy.

void egglib::FreqHolder::process_site(const SiteHolder &site)

Compute frequencies (structure cannot have individual level)

void egglib::FreqHolder::process_vcf(const VcfParser &vcf)

Get frequencies from VCF.

void egglib::FreqHolder::set_genotype_item(unsigned int i, unsigned int j, unsigned int a)

Set part of a genotype.

void egglib::FreqHolder::set_nall(unsigned int na, unsigned int ng)

Before loading frequencies manually.

void egglib::FreqHolder::setup_pop(unsigned int i, unsigned int clu_idx, unsigned int rel_idx, unsigned int ns)

Follows setup_raw() (for all pops)

void egglib::FreqHolder::setup_raw(unsigned int nc, unsigned int np, unsigned int no, unsigned int ploidy)

Setup manual structure.

void egglib::FreqHolder::setup_structure(const StructureHolder *structure, unsigned int ploidy)

Set up based on provided structure (no individual level)

FreqSet

class

Class holding frequencies in a given compartment.

Public Functions

egglib::FreqSet::FreqSet()

Constructor (all empty)

egglib::FreqSet::~FreqSet()

Destructor.

void egglib::FreqSet::add_genotypes(unsigned int num)

Add genotypes.

unsigned int egglib::FreqSet::frq_all(unsigned int)
const

Get an allele frequency.

unsigned int egglib::FreqSet::frq_gen(unsigned int)
const

Get an genotype frequency.

unsigned int egglib::FreqSet::frq_het(unsigned int)
const

Frequency of heterozygotes have >= 1 copies of allele.

void egglib::FreqSet::incr_allele(unsigned int all_idx, unsigned int num)

Increment frequency of a given allele.

void egglib::FreqSet::incr_genotype(unsigned int gen_idx, unsigned int num)

Increment frequency of a given genotype.

unsigned int egglib::FreqSet::nieff()
const

Total frequency (number of individuals) (0 if haploid)

unsigned int egglib::FreqSet::nseff()
const

Total frequency (number of samples)

unsigned int egglib::FreqSet::num_alleles()
const

Number of alleles (equal to user-provided value)

unsigned int egglib::FreqSet::num_alleles_eff()
const

Number of alleles with non-null frequency.

unsigned int egglib::FreqSet::num_genotypes()
const

Number of genotypes (user-provided)

unsigned int egglib::FreqSet::num_genotypes_eff()
const

Number of genotypes with non-null frequency.

void egglib::FreqSet::reset(unsigned int)

Set number of alleles (set nsam/ngen to 0)

void egglib::FreqSet::setup()

Setup.

void egglib::FreqSet::tell_het(unsigned int i, unsigned int a)

Tell the class that genotype i is heterozygote for allele a call it several times! don’t change frequencies after that!

unsigned int egglib::FreqSet::tot_het()
const

Total frequency of heterozygotes.

Site-level operations

SiteHolder

class

Holds data for a site for diversity analysis.

Usage of this class: first set the ploidy. Then, either load an alignment or data from a VCF, or individuals manually. Before loading individuals manually, it is required to pre-set the number such as the indexes will exist. If you don’t set all samples manually with load_ing() or load_otg(), you must force set the alleles to the default value. Note: the instance is not reset unless you ask it. Data will add up.

Public Functions

egglib::SiteHolder::SiteHolder()

Constructor (ploidy = 1)

egglib::SiteHolder::SiteHolder(unsigned int ploidy)

Constructor.

virtual egglib::SiteHolder::~SiteHolder()

Destructor.

void egglib::SiteHolder::add_ing(unsigned int num)

Add ingroup individuals.

void egglib::SiteHolder::add_otg(unsigned int num)

Add outgroup individuals.

int egglib::SiteHolder::get_allele(unsigned int)
const

Get an allele (MISSINGDATA for MISSING)

unsigned int egglib::SiteHolder::get_i(unsigned int idv, unsigned int chrom)
const

Get allele index for ingroup (MISSING for missing data)

unsigned int egglib::SiteHolder::get_missing()
const

Number of missing alleles found in the last processed data.

unsigned int egglib::SiteHolder::get_missing_ing()
const

Total number of missing alleles in ingroup.

unsigned int egglib::SiteHolder::get_missing_otg()
const

Total number of missing alleles in outgroup.

unsigned int egglib::SiteHolder::get_nall()
const

Number of alleles.

unsigned int egglib::SiteHolder::get_nall_ing()
const

Number of alleles in ingroup only.

unsigned int egglib::SiteHolder::get_ning()
const

Get number of ingroup individuals.

unsigned int egglib::SiteHolder::get_nout()
const

Get number of outgroup individuals.

unsigned int egglib::SiteHolder::get_o(unsigned int idv, unsigned int chrom)
const

Get allele index for outgroup (MISSING for missing data)

const unsigned int *egglib::SiteHolder::get_pi(unsigned int idv)
const

get_i as pointer

unsigned int egglib::SiteHolder::get_ploidy()
const

Get ploidy.

const unsigned int *egglib::SiteHolder::get_po(unsigned int idv)
const

get_o as pointer

unsigned int egglib::SiteHolder::get_straight_i(unsigned int sam)
const

Get allele index for ingroup (MISSING for missing data)

unsigned int egglib::SiteHolder::get_straight_o(unsigned int sam)
const

Get allele index for outgroup (MISSING for missing data)

unsigned int egglib::SiteHolder::get_tot_missing()
const

Total number of missing alleles.

void egglib::SiteHolder::load_ing(unsigned int idv, unsigned int chrom, int allele)

Analyze an ingroup allele (MISSINGDATA for missing data)

void egglib::SiteHolder::load_otg(unsigned int idv, unsigned int chrom, int allele)

Analyze an outgroup allele (MISSINGDATA for missing data)

bool egglib::SiteHolder::process_align(const DataHolder &data, unsigned int idx, const StructureHolder *struc, const Filter &filtr, unsigned int max_missing, bool consider_outgroup_missing)

Process an alignment.

Does not reset instance! Ploidy must be defined before!

Return
true if the number of missing data was not exceeded.
Parameters
  • data -

    an alignment.

  • idx -

    index of the site to process.

  • struc -

    the structure to use (NULL can be passed to process all samples as haploid individuals).

  • filtr -

    the allele filter.

  • max_missing -

    maximum number of missing alleles. If there are more missing data, stop processing and return false. Then, the instance should absolutely not be used further.

  • consider_outgroup_missing -

    if true, consider missing data for the max_missing argument, otherwise, only count missing data of the ingroup.

bool egglib::SiteHolder::process_vcf(VcfParser &data, unsigned int start, unsigned int stop, unsigned int max_missing)

Import allelic data and compute frequencies from VCF data.

Beware: this method does not reset the instance.

Return
A boolean specifying whether processing was completed.
Parameters
  • data -

    a VcfParser reference containing data and having the GT format field filled.

  • start -

    index of the first sample to consider.

  • stop -

    index of the last sample to consider.

  • max_missing -

    maximum number of missing alleles. If this proportion is processing is stopped and get_missing() returns max_missing + 1. Only missing data in this data set, and in the ingroup, are considered.

void egglib::SiteHolder::reset(unsigned int ploidy)

Reset all to defaults.

void egglib::SiteHolder::set_allele(unsigned int, int a)
const

Set an allele value.

void egglib::SiteHolder::set_i(unsigned int idv, unsigned int chrom, unsigned int all)

Set allele index for ingroup (MISSING for missing data)

void egglib::SiteHolder::set_nall(unsigned int all, unsigned int ing)

Set number of alleles (up to you that all is consistent)

void egglib::SiteHolder::set_o(unsigned int idv, unsigned int chrom, unsigned int all)

Set allele index for outgroup (MISSING for missing data)

SiteGeno

class

SiteHolder subclass to transform regular data to genotypic.

Public Functions

bool egglib::SiteGeno::homoz(unsigned int genotype)
const

tell if a genotype is homozygote

void egglib::SiteGeno::process(const SiteHolder &src)

reset and get data from a site

SiteDiversity

class

Diversity analyses at the level of a site

Computes standard diversity indexes for a unique site or marker.

process() and average() return a composite flag.

Statistics:

  • If fstats_diplo is called
    • npop_eff2 (pops with >= 1 indiv)
  • If fstats_haplo is called
    • npop_eff3 (pops with >= 1 sample)
  • If fstats_hier is called
    • nclu
    • nclu_eff (>= 1 pops each with >= 1 indiv)
    • npop_eff2 (same as for fstats_diplo)
  • flag&1 (always on for process()):
    • ns
  • flag&2:
    • npop
    • Aglob
    • Aing
    • Stot
    • pairdiff
    • He
    • R
    • npop_eff1 (pops with >= 2 samples)
    • He[pop] (for pops with >= 2 samples)
    • pairdiff_pop[pop1][pop2] (for pops with >= 2 samples and pop2 != pop1)
  • flag&4:
    • thetaIAM
    • thetaSMM
  • flag&8:
    • Ho
    • Hi
  • flag&16:
    • Sder
    • der
  • flag&2048:
    • Aout
  • flag&32:
    • n
    • d
  • flag&64:
    • a
    • b
    • c
  • flag&128:
    • a0
    • b1
    • b2
    • c0
  • flag&256:
    • JostD
  • flag&2048:
    • Hst
  • flag&4096:
    • Gst
  • flag&8192:
    • Gste
  • flag&512: ns, Aglob, Aing, Aout, Stot, Sder, and der are actually integers
  • flag&1024: site is polymorphic / there is at least one polymorphic site
    • MAF
    • MAF_pop

Fit = 1 - c/(a+b+c) Fst = a/(a+b+c) Fis = 1 - c/(b+c)

Fst = n/d

Fit = 1 - c0/(a0+b2+b1+c0) Fst = (a0+b2)/(a0+b2+b1+c0) Fct = a0/(a0+b2+b1+c0) Fis = 1 - c0/(b1+c0)

Hst = 1 - Hs / He Gst = 1 - Hs / Httilde Gste = 1 - Hse / Hte

Requires: stats()

Public Functions

egglib::SiteDiversity::SiteDiversity()

Constructor.

virtual egglib::SiteDiversity::~SiteDiversity()

Destructor.

double egglib::SiteDiversity::a()
const

Computed by fstats_diplo()

double egglib::SiteDiversity::a0()
const

Computed by fstats_hier()

double egglib::SiteDiversity::Aglob()
const

Number of alleles (including outgroup-specific alleles) (stats)

double egglib::SiteDiversity::Aing()
const

Number of alleles excluding outgroup-specific alleles (stats)

double egglib::SiteDiversity::Aout()
const

Number of different alleles in the outgroup (stats)

unsigned int egglib::SiteDiversity::average()

Compute the average of all stats (except those per pop)

double egglib::SiteDiversity::b()
const

Computed by fstats_diplo()

double egglib::SiteDiversity::b1()
const

Computed by fstats_hier()

double egglib::SiteDiversity::b2()
const

Computed by fstats_hier()

double egglib::SiteDiversity::c()
const

Computed by fstats_diplo()

double egglib::SiteDiversity::c0()
const

Computed by fstats_hier()

double egglib::SiteDiversity::d()
const

Computed by fstats_haplo()

double egglib::SiteDiversity::D()
const

Computed by hstats()

double egglib::SiteDiversity::derived(unsigned int)
const

< Number of derived alleles (stats+outgroup)

Derived allele frequency (stats+outgroup)

unsigned int egglib::SiteDiversity::flag()
const

Get flag value.

int egglib::SiteDiversity::global_allele(unsigned int)
const

Get one of the global alleles.

double egglib::SiteDiversity::Gst()
const

Computed by hstats()

double egglib::SiteDiversity::Gste()
const

Computed by hstats()

double egglib::SiteDiversity::He()
const

Unbiased heterozygosity (averaged if relevant) stats()

double egglib::SiteDiversity::He_pop(unsigned int pop)
const

Unbiased heterozygosity for a population stats()

double egglib::SiteDiversity::Hi()
const

Avg number of differents between individuals (stats)

double egglib::SiteDiversity::Ho()
const

Frequency of heterozygotes (stats)

double egglib::SiteDiversity::Hst()
const

Computed by hstats()

unsigned int egglib::SiteDiversity::k()
const

Number of populations (stats)

double egglib::SiteDiversity::MAF()
const

Frequency of second most frequent allele.

double egglib::SiteDiversity::MAF_pop(unsigned int)
const

MAF for a population.

double egglib::SiteDiversity::n()
const

Computed by fstats_haplo()

unsigned int egglib::SiteDiversity::nclu_eff()
const

Number of clusters with >= 1 pop with >= 1 indiv (fstats_hier)

unsigned int egglib::SiteDiversity::npop_eff1()
const

Number of populations with >= 2 samples (stats)

unsigned int egglib::SiteDiversity::npop_eff2()
const

Number of populations with >= 1 indiv (fstats_diplo + fstats_hier)

unsigned int egglib::SiteDiversity::npop_eff3()
const

Number of populations with >= 1 sample (fstats_haplo)

double egglib::SiteDiversity::ns()
const

Number of analyzed samples (stats)

unsigned int egglib::SiteDiversity::nsites1()
const

For average ns.

unsigned int egglib::SiteDiversity::nsites10()
const

For average Hst.

unsigned int egglib::SiteDiversity::nsites11()
const

For average Gst.

unsigned int egglib::SiteDiversity::nsites12()
const

For average Gste.

unsigned int egglib::SiteDiversity::nsites2()
const

For average Aglob, Aing, Atot, Stot, pairdiff, He, thetaIAM, and thetaSMM.

unsigned int egglib::SiteDiversity::nsites3()
const

For average thetaIAM and thetaSMM.

unsigned int egglib::SiteDiversity::nsites4()
const

For average Ho and Hi.

unsigned int egglib::SiteDiversity::nsites5()
const

For average derived and Sd.

unsigned int egglib::SiteDiversity::nsites6()
const

For average n and d.

unsigned int egglib::SiteDiversity::nsites7()
const

For average a, b, and c.

unsigned int egglib::SiteDiversity::nsites8()
const

For average c0, b1, b2, a0.

unsigned int egglib::SiteDiversity::nsites9()
const

For average D.

bool egglib::SiteDiversity::orientable()
const

True if the site is orientable (stats)

double egglib::SiteDiversity::pairdiff()
const

Average number of pairwise differences (stats)

double egglib::SiteDiversity::pairdiff_inter(unsigned int pop1, unsigned int pop2)
const

Average number of differences between a pair of population (stats)

unsigned int egglib::SiteDiversity::process(const FreqHolder &frq)

Compute toggled statistics.

double egglib::SiteDiversity::R()
const

Allelic richness (stats)

void egglib::SiteDiversity::reset()

Reset stats sums to 0 (keep toggled flags)

double egglib::SiteDiversity::S()
const

Number of alleles at frequency one (singletons) (stats)

double egglib::SiteDiversity::Sd()
const

Number of derived singletons (stats) (requires outgroup)

double egglib::SiteDiversity::thetaIAM()
const

Requires stats()

double egglib::SiteDiversity::thetaSMM()
const

Requires stats()

void egglib::SiteDiversity::toggle_fstats_diplo()

Toggle F-statistics.

void egglib::SiteDiversity::toggle_fstats_haplo()

Toggle F-statistics.

void egglib::SiteDiversity::toggle_fstats_hier()

Toggle F-statistics.

void egglib::SiteDiversity::toggle_hstats()

Toggle H-statistics.

void egglib::SiteDiversity::toggle_off()

Set all flags to off.

CodingSite

class

Holds data for a coding site (as a triplet of sites)

This class assists the detection of polymorphism at codon sites, although diversity analyses themselves have to be performed using SiteDiversity itself. The class perform analyses through the process() method, which automatically resets all previously stored data.

Header: <egglib-cpp/CodongSite.hpp>

Public Functions

egglib::CodingSite::CodingSite()

Constructor.

egglib::CodingSite::~CodingSite()

Destructor.

const SiteHolder &egglib::CodingSite::aminoacids()
const

Get access to amino acid data.

Requires that a codon site has been analyzed using process().

This instance contains amino acids. It contains integer data representing amino acids (‘*’ for stop codons). The GeneticCode instance passed to process() determines the translation.

const SiteHolder &egglib::CodingSite::codons()
const

Get access to merged codon data.

Requires that a codon site has been analyzed using process().

This instance contains codon alleles. See GeneticCode for information about encoding of codons.

bool egglib::CodingSite::mutated(unsigned int codon_allele1, unsigned int codon_allele2, unsigned int pos)
const

Check if two given codon alleles differ at a given position.

Tells, for two given codon alleles (see alleles()), if they differ (at least) at the specified position. Possible values are 0, 1 or 2.

Requires that a codon site has been analyzed using process().

The indexes must both be < nall() but they may be passed in any order.

unsigned int egglib::CodingSite::ndiff(unsigned int codon_allele1, unsigned int codon_allele2)
const

Number of nucleotides differences between two given codon alleles.

Give, for two given codon alleles (see alleles()), the number of nucleotide differences between them. Possible values are 1, 2 or 3.

Requires that a codon site has been analyzed using process().

The indexes must both be < nall() but they may be passed in any order.

unsigned int egglib::CodingSite::ni()
const

Number of ingroup indiv.

unsigned int egglib::CodingSite::no()
const

Number of outgroup indiv.

bool egglib::CodingSite::NS(unsigned int codon_allele1, unsigned int codon_allele2)
const

True if the two given codon alleles encode different aminoacids.

Requires that a codon site has been analyzed using process().

The indexes must both be < nall() but they may be passed in any order.

unsigned int egglib::CodingSite::nseff()
const

Number of analyzed samples.

Requires that a codon site has been analyzed using process().

Value bound by 0 and ns(). Depends on the number of missing data (and stop codons, if skipstop was set to true).

unsigned int egglib::CodingSite::nseffo()
const

Analyzed samples for outgroup.

double egglib::CodingSite::NSsites()
const

Estimated number of nonsynonymous sites.

Requires that a codon site has been analyzed using process().

unsigned int egglib::CodingSite::nstop()
const

Number of stop codons met during processing of codon site.

Requires that a codon site has been analyzed using process().

Not affected by the skipstop option.

unsigned int egglib::CodingSite::pl()
const

Ploidy.

bool egglib::CodingSite::process(const SiteHolder &site1, const SiteHolder &site2, const SiteHolder &site3, const GeneticCode &code, bool skipstop, unsigned int max_missing)

Analyzes a codon site.

The three codon positions must be loaded as Site instances containing nucleotides encoded as integer values. All values except values equal to A, C, G and T (case-independent) are treated as missing data. Obviously, the three sites must have the same number of samples and also the same number of populations (and matching affectation of samples to populations). Upon processing, the class generates and holds a Site instance (available as codons()) containing data from the three sites merged, and another (available as aminoacids()) with the same data translated).

Return
A boolean indicating whether analysis was completed. If False, data contained in the object should not be used since stored objects will not have been filled. Even ni() and nieff() will be invalid.
Note
It is not allowed to use egglib::UNKNOWN for any of A, C, G and T argument.
Parameters
  • site1 -

    first nucleotide position of the codon.

  • site2 -

    second nucleotide position of the codon.

  • site3 -

    third nucleotide position of the codon.

  • code -

    GeneticCode instance representing the code to be used for treating this codon.

  • skipstop -

    if true, stop codons are treated as missing data and skipped. If set to true, potential mutations to stop codons are not taken into account when estimating the number of non-synonymous sites. Warning (this may be counter-intuitive). It actually assumes that stop codons are not biologically plausible and considers them as missing data. On the other hand, if skipstop is false, it takes stop codons as if they were valid amino acids.

  • max_missing -

    maximum number of missing data to allow (including stop codons if skipstop if true).

double egglib::CodingSite::Ssites()
const

Estimated number of synonymous sites.

Requires that a codon site has been analyzed using process().

AlleleStatus

class

Classify alleles and site for frequencies with several populations.

Statistics: Sp population-specific alleles Spd population-specific derived alleles ShP number of alleles segregating in at least one pair of populations ShA number of alleles in non-null frequencies in at least one pair of populations FxA number of alleles fixed in at least one population FxD number of fixed differences (two different alleles fixed in a pair of populations)

The user must ensure that all passed sites are polymorphic. The user should also probably exclude populations with low sample sizes if they are interested in the number of fixed alleles (populations with no samples are automatically skipped).

The statistics are computed for each site. Sums for multi-sites are available as Sp_T and Sp_T1 (and similarly for other statistics). T1 is such as each site is counted only once for any statistic.

Header: <egglib-cpp/AlleleStatus.hpp>

Public Functions

egglib::AlleleStatus::AlleleStatus()

Constructor.

egglib::AlleleStatus::~AlleleStatus()

Destructor.

unsigned int egglib::AlleleStatus::FxA()
const

Fixed alleles.

unsigned int egglib::AlleleStatus::FxA_T1()
const

Fixed alleles.

unsigned int egglib::AlleleStatus::FxD()
const

Fixed differences.

unsigned int egglib::AlleleStatus::FxD_T1()
const

Fixed differences.

unsigned int egglib::AlleleStatus::nsites()
const

Number of sites with valid data.

unsigned int egglib::AlleleStatus::nsites_o()
const

Number of orientable sites with valid data.

void egglib::AlleleStatus::process(const FreqHolder &freqs)

Analyze a site.

void egglib::AlleleStatus::reset()

Reset sums (but keep toggle flag)

unsigned int egglib::AlleleStatus::ShA()
const

Shared alleles.

unsigned int egglib::AlleleStatus::ShA_T1()
const

Shared alleles.

unsigned int egglib::AlleleStatus::ShP()
const

Shared polymorphisms.

unsigned int egglib::AlleleStatus::ShP_T1()
const

Shared polymorphisms.

unsigned int egglib::AlleleStatus::Sp()
const

Pop-specific alleles.

unsigned int egglib::AlleleStatus::Sp_T1()
const

Pop-specific alleles.

unsigned int egglib::AlleleStatus::Spd()
const

Pop-specific derived alleles.

unsigned int egglib::AlleleStatus::Spd_T1()
const

Pop-specific derived alleles.

void egglib::AlleleStatus::total()

Copy all sums to director accessors.

ComputeV

class

Compute allele size variance.

Public Functions

egglib::ComputeV::ComputeV()

Constructor.

egglib::ComputeV::~ComputeV()

Destructor.

double egglib::ComputeV::average()
const

Get average V (UNDEF if no computed values)

double egglib::ComputeV::compute(const FreqSet &frq)

Compute V (UNDEF if not computable)

unsigned int egglib::ComputeV::num_sites()
const

Number of sites with computed V.

void egglib::ComputeV::reset()

Reset.

void egglib::ComputeV::set_allele(unsigned int, int a)

Set an allele value.

void egglib::ComputeV::setup_alleles(unsigned int)

Specify number of alleles.

void egglib::ComputeV::setup_alleles_from_site(const SiteHolder &site)

Get alleles directly from site.

Generic classes

Diversity1

class

Compute population summary statistics from allele frequencies at several sites.

Diversity1 instances cannot be copied. This class is designed to allow reuse of objects without unnecessary memory reallocation.

This class computes statistics that does not require access to a full Site instance and for which only frequencies are needed. The frequency for all sites that must be analyzed should be loaded.

Statistics:

code | requirement | flag | toggle flag =========|=============================|=======|============ lt | - | - | - ls | - | - | - nsmax | ls>0 | 1 | - S | ls>0 | 1 | - Ss | ls>0 | 1 | - eta | ls>0 | 1 | - Pi | ls>0 | 1 | - lso | - | 4 | ori_site nsmaxo | lso>0 | 8 | ori_site So | lso>0 | 8 | ori_site Sso | lso>0 | 8 | ori_site etao | lso>0 | 8 | ori_site lM | lso>0 | 8 | ori_site pM | lM>0 | 16 | ori_site nseffo | lso>0 | 32 | ori_div thetaH | lso>0 | 32 | ori_div thetaL | lso>0 | 32 | ori_div Hns | lso>0 | 32 | ori_div Hsd | So>0 & nseffo>=3 & varZ>0 | 1024 | ori_div E | So>0 & nseffo>=3 & varE>0 | 2048 | ori_div Dfl | So>0 & nseffo>=3 & varDfl>0 | 4096 | ori_div F | So>0 & nseffo>=3 & varF>0 | 8192 | ori_div nseff | ls>0 | 128 | basic thetaW | ls>0 | 128 | basic Dxy | ls>0 npop=2 | 16384 | basic Da | ls>0 npop=2 | 16384 | basic Fstar | S>0 & ns>2 | 256 | basic D | S>0 & ns>3 | 512 | basic Deta | S>0 & ns>3 | 512 | basic Dstar | S>0 & ns>3 | 512 | basic sites | i<S | - | site_lists sites_o | i<So | - | site_lists singl | i<Ss | - | site_lists singl_o | i<Sso | - | site_lists

Header: <egglib-cpp/Diversity.hpp>

Public Functions

egglib::Diversity1::Diversity1()

Constructor.

egglib::Diversity1::~Diversity1()

Destructor.

unsigned int egglib::Diversity1::compute()

Compute statistics, return flag but does not reset.

double egglib::Diversity1::D()
const

Tajima’s D.

double egglib::Diversity1::Da()
const

Net pairwise distance for 1st pair.

double egglib::Diversity1::Deta()
const

Tajima’s D using eta instead of S.

double egglib::Diversity1::Dfl()
const

Fu and Li’s D.

double egglib::Diversity1::Dstar()
const

Fu and Li’s D*.

double egglib::Diversity1::Dxy()
const

Pairwise distance for 1st pair.

double egglib::Diversity1::E()
const

Zeng et al.’s E.

unsigned int egglib::Diversity1::eta()
const

eta

unsigned int egglib::Diversity1::etao()
const

eta for orientable sites

double egglib::Diversity1::F()
const

Fu and Li’s F.

double egglib::Diversity1::Fstar()
const

Fu and Li’s F*.

double egglib::Diversity1::Hns()
const

Unstandardized Fay and Wu’s H.

double egglib::Diversity1::Hsd()
const

Fay and Wu’s H standardized by Zeng et al.

void egglib::Diversity1::load(const FreqHolder &freqs, const SiteDiversity &div, unsigned int position)

Analyze a site.

unsigned int egglib::Diversity1::ls()
const

Number of loaded sites (with >=2 valid data)

unsigned int egglib::Diversity1::lso()
const

Number of loaded orientable sites (with valid data)

unsigned int egglib::Diversity1::lt()
const

Number of loaded sites (total)

unsigned int egglib::Diversity1::nM()
const

Sites available for MFDM test.

double egglib::Diversity1::nseff()
const

Average number of used samples.

double egglib::Diversity1::nseffo()
const

Average number of used samples for orientable sites.

unsigned int egglib::Diversity1::nsingld()
const

Number derived singletons.

unsigned int egglib::Diversity1::nsmax()
const

Largest number of used samples.

unsigned int egglib::Diversity1::nsmaxo()
const

Largest number of used samples for orientable sites.

double egglib::Diversity1::Pi()
const

Sum of He.

double egglib::Diversity1::pM()
const

Li’s MFDM test p value (large positive value by default)

void egglib::Diversity1::reset_stats()

Reset counters to 0.

unsigned int egglib::Diversity1::S()
const

Number of polymorphic sites.

void egglib::Diversity1::set_option_multiple(bool b)

Set multiple option (default: False)

void egglib::Diversity1::set_option_ns_set(unsigned int)

Set maximum number of samples, for H and co. (default: UNKNOWN)

unsigned int egglib::Diversity1::singl(unsigned int)
const

Get position of site with a singleton.

unsigned int egglib::Diversity1::singl_o(unsigned int)
const

Get position of site with an orientable singleton.

unsigned int egglib::Diversity1::site(unsigned int)
const

Get position of polymorphic site.

unsigned int egglib::Diversity1::site_o(unsigned int)
const

Get position of polymorphic orientable site.

unsigned int egglib::Diversity1::So()
const

Number of polymorphic orientable sites.

unsigned int egglib::Diversity1::Ss()
const

Number of polymorphic sites with =1 singleton.

unsigned int egglib::Diversity1::Sso()
const

Number of polymorphic orientable sites with =1 singleton.

double egglib::Diversity1::thetaH()
const

ThetaH estimator.

double egglib::Diversity1::thetaL()
const

ThetaL estimator.

double egglib::Diversity1::thetaPi()
const

thetaPi estimator (using orientable sites)

double egglib::Diversity1::thetaW()
const

Theta estimator based on S.

void egglib::Diversity1::toggle_basic()

Activate basic per-gene.

void egglib::Diversity1::toggle_off()

Cancel all flags.

void egglib::Diversity1::toggle_ori_div()

Activate per-gene oriented.

void egglib::Diversity1::toggle_ori_site()

Activate per-site oriented.

void egglib::Diversity1::toggle_site_lists()

Activate lists of site positions.

Diversity2

class

Compute population summary statistics from an array of sites.

Diversity instances cannot be copied. This class is designed to allow reuse of objects without unnecessary memory reallocation.

This class computes statistics that require access to the full Site instance (and therefore to the individual allele of each individual). Sites with missing data are ignored when computing Wall’s statistics.

Meaning of flag:

  • flag&1 an error occurred (+ one of 2, 4, 8, 16, 32)
  • flag&2 error: less than 2 samples (including missing)
  • flag&4 error: inconsistent number of samples
  • flag&8 error: inconsistent ploidy
  • flag&16 error: inconsistent frequency holder (sample size not checked)
  • flag&32 error: provided SiteDiversity does not have proper data
  • flag&64 at least 1 polymorphic site with at least 2 non-missing samples
  • flag&128 at least 1 polymorphic, orientable site with at least 2 non-missing samples
  • flag&256 computed R2, R3, R4, and Ch
  • flag&512 computed R2E, R3E, R4E, and ChE
  • flag&1024 computed B and Q (at least 2 sites with no missing data)

Header: <egglib-cpp/Diversity.hpp>

Public Functions

egglib::Diversity2::Diversity2()

Constructor.

egglib::Diversity2::~Diversity2()

Destructor.

double egglib::Diversity2::B()
const

Wall’s statistic.

double egglib::Diversity2::Ch()
const

Ramos-Onsins and Rozas’s statistic.

double egglib::Diversity2::ChE()
const

Ramos-Onsins and Rozas’s statistic.

unsigned int egglib::Diversity2::compute()

Compute singletons and/or partitions stats, return flag.

double egglib::Diversity2::k()
const

Average number of differences.

double egglib::Diversity2::ko()
const

Average number of differences at orientable sites.

void egglib::Diversity2::load(const SiteHolder &site, const SiteDiversity &div, const FreqHolder &frq)

Load site with its matching structure (requires basic stats)

unsigned int egglib::Diversity2::num_clear()
const

Number of sites with 0 missing data (Wall stats)

unsigned int egglib::Diversity2::num_orientable()
const

Number of orientable sites.

unsigned int egglib::Diversity2::num_sequences()
const

Number of sequences.

unsigned int egglib::Diversity2::num_sites()
const

Number of loaded sites (only polymorphic)

double egglib::Diversity2::Q()
const

Wall’s statistic.

double egglib::Diversity2::R2()
const

Ramos-Onsins and Rozas’s statistic.

double egglib::Diversity2::R2E()
const

Ramos-Onsins and Rozas’s statistic.

double egglib::Diversity2::R3()
const

Ramos-Onsins and Rozas’s statistic.

double egglib::Diversity2::R3E()
const

Ramos-Onsins and Rozas’s statistic.

double egglib::Diversity2::R4()
const

Ramos-Onsins and Rozas’s statistic.

double egglib::Diversity2::R4E()
const

Ramos-Onsins and Rozas’s statistic.

void egglib::Diversity2::reset()

Restore all variables to the default state (except toggled flags)

void egglib::Diversity2::set_option_multiple(bool b)

Toggle option for multiple alleles.

void egglib::Diversity2::toggle_off()

Cancel flags.

void egglib::Diversity2::toggle_partitions()

Activate computation of B and Q stats (must be set before load()

void egglib::Diversity2::toggle_singletons()

Activate computation of Rx/Ch RxE/ChE stats.

Other statistics

Haplotypes

class

Identifies haplotypes from a set of sites.

How to use this class:

  • 1) Setup (or reset) and an optional structure.
  • 2) Load all sites with load(site). Haplotypes are computed and all samples with at least one missing data are marked as missing. While loading sites, you can monitor the values of:
  • 3) Call cp_haplotypes() to finalize haplotype processing. After that you may use:
  • 4) If you wish (but you don’t have to), you can try and guess the haplotype of samples with missing data. For this you need first to call prepare_impute(). After that, you may use:
  • 5) If you impute, load again all sites with solve(). This may change the value of:
    • map()
  • 6) If you impute, you are required to call impute() (and otherwise you can’t). If you do, the following values will be updated [note: maybe impute and cp_haplotypes do exactly the same]
  • 7) After whether or not you performed 4-6, you can now call this to make a site:
  • 8) Whether or not you performed 4-6 and/or 7, you can now call cp_dist() that will let you access to the distance matrix:
  • 9) To compute stats, call cp_stats() (you must have set a structure and called 8). Then you may use:
    • Fst()
    • Kst() The function cp_stats() returns a flag: 0 (no stats computed), 1 (Fst computed), or 2 (both Fst and Kst computed).
  • 10) You can also call this method (requires a structure and 8 but not 9, and valid if nstot is >1):

Header: <egglib-cpp/Haplotypes.hpp>

Public Functions

egglib::Haplotypes::Haplotypes()

Constructor.

egglib::Haplotypes::~Haplotypes()

Destructor.

void egglib::Haplotypes::cp_dist()

Compute distance matrix.

void egglib::Haplotypes::cp_haplotypes()

Finalize haplotype estimation.

unsigned int egglib::Haplotypes::cp_stats()

Compute differentiation stats.

unsigned int egglib::Haplotypes::dist(unsigned int, unsigned int)
const

Distance matrix entry (0<=j<i<_nt_hapl)

unsigned int egglib::Haplotypes::freq_i(unsigned int)
const

Frequency of haplotype in intgroup.

unsigned int egglib::Haplotypes::freq_o(unsigned int)
const

Frequency of haplotype in outgroup.

double egglib::Haplotypes::Fst()
const

Fst value.

void egglib::Haplotypes::get_site(SiteHolder &site)

Get haplotypic data as a site (recycle passed object)

unsigned int egglib::Haplotypes::hapl(unsigned int, unsigned int)
const

Get site j of haplotype i.

void egglib::Haplotypes::impute()

Try to guess haplotype of samples with missing data.

double egglib::Haplotypes::Kst()
const

Gst value.

void egglib::Haplotypes::load(const SiteHolder &site)

Process a site (implies call to setup())

unsigned int egglib::Haplotypes::map_sample(unsigned int)
const

Get haplotype index of a sample.

unsigned int egglib::Haplotypes::mis_idx(unsigned int)
const

Index of one of the samples with missing data.

unsigned int egglib::Haplotypes::n_ing()
const

Total number of ingroup samples.

unsigned int egglib::Haplotypes::n_mis()
const

Number of samples with missing data.

unsigned int egglib::Haplotypes::n_missing(unsigned int)
const

Number of missing data per sample.

unsigned int egglib::Haplotypes::n_otg()
const

Total number of outgroup samples.

unsigned int egglib::Haplotypes::n_pop()
const

Number of populations.

unsigned int egglib::Haplotypes::n_potential(unsigned int)
const

Number of compatible haplotypes for a sample with missing data.

unsigned int egglib::Haplotypes::n_sam()
const

Total number of samples.

unsigned int egglib::Haplotypes::n_sites()
const

Number of sites.

unsigned int egglib::Haplotypes::ne_ing()
const

Non-missing number of ingroup samples.

unsigned int egglib::Haplotypes::ne_otg()
const

Non-missing number of outgroup samples.

unsigned int egglib::Haplotypes::ne_pop()
const

Number of populations with.

unsigned int egglib::Haplotypes::ng_hapl()
const

Number of haplotypes at non-zero frequency overall.

unsigned int egglib::Haplotypes::ni_hapl()
const

Number of haplotypes at non-zero ingroup frequency.

unsigned int egglib::Haplotypes::nstot()
const

Number of samples (non-missing and not ignored by structure)

unsigned int egglib::Haplotypes::nt_hapl()
const

Total number of haplotypes (including truncated ones)

unsigned int egglib::Haplotypes::potential(unsigned int, unsigned int)
const

Index of a potential haplotype.

void egglib::Haplotypes::prepare_impute(unsigned int)

Initialize required tables for imputing.

void egglib::Haplotypes::resolve(const SiteHolder &site)

Second pass: try to resolve missing data.

void egglib::Haplotypes::setup(const StructureHolder *struc)

Setup/reset instance (with optional structure object)

double egglib::Haplotypes::Snn()
const

Snn value computed on the fly.

ParalogPi

class

Specifically designed to compute Innan 2003 statistics for a gene family.

This class computes the within- and between-paralog Pi of Innan (2003). The within-paralog Pi is the same as the standard Pi, except that it is not unbiased. The between-paralog Pi is the same as Dxy, taking the paralogs as populations, except that one pair of genes (paralogs from the same sample) is not considered.

Setup provides the two structure objects describing respectively the structure in paralogs and the structure in samples (two different structure objects are required because they are necessarily non-nested). It is required that the structures of interest are loaded as population levels. Cluster levels are ignored. It is required that ploidy is 1 (if you have genotypic data, skip the individual level). Required tests: ploidy of both structure and of all sites is equal to 1, the maximum index of the paralog structure is represented in all sites (other disagrements are treated as missing data).

Public Functions

egglib::ParalogPi::ParalogPi()

Constructor (default: 0 pop)

egglib::ParalogPi::~ParalogPi()

Destructor.

void egglib::ParalogPi::load(const SiteHolder &site)

Load a site.

unsigned int egglib::ParalogPi::num_paralogs()
const

Number of copies (or: number of pops)

unsigned int egglib::ParalogPi::num_samples()
const

Number of samples (or: size of each pop)

unsigned int egglib::ParalogPi::num_sites_pair(unsigned int, unsigned int)
const

Number of analyzed sites for a pair of paralogs.

unsigned int egglib::ParalogPi::num_sites_paralog(unsigned int)
const

Number of analyzed sites for a paralog.

unsigned int egglib::ParalogPi::num_sites_tot()
const

Total number of analyzed sites.

double egglib::ParalogPi::Pib(unsigned int, unsigned int)
const

Between-paralog Pi.

double egglib::ParalogPi::Piw(unsigned int)
const

Within-paralog Pi.

void egglib::ParalogPi::reset(const StructureHolder &str_prl, const StructureHolder &str_idv)

Reset and setup structure.

Rd

class

Compute the bar{r_d} (or rD) statistic.

Rd instances cannot be copied. The procedure is:

  • Load as many sites as needed. They are analysed on the fly. The number of samples and ploidy are expected to match (as well as the order of individuals and alleles). If there is a mismatch in ploidy and/or number of samples, the Rd value will be UNDEF.
  • Compute the Rd value (resets the instance).

Public Functions

egglib::Rd::Rd()

Constructor.

egglib::Rd::~Rd()

Destructor.

double egglib::Rd::compute()

Compute rD (UNDEF if not computable)

void egglib::Rd::load(const SiteHolder &site)

Load a site.

unsigned int egglib::Rd::num_indiv()
const

Get number of individuals as provided to configure()

unsigned int egglib::Rd::num_loci()
const

Get number of processed loci (some loci may be skipped)

unsigned int egglib::Rd::ploidy()
const

Get ploidy as provided to configure()

void egglib::Rd::reset()

Reset to initial state.

Fu’s F

double egglib::Fs(unsigned int n, unsigned int K, double pi)

Compute Fu’s Fs.

This function computes Fu’s Fs statistic using haplotype statistics (that should have been computed using the Haplotypes class) and, as a theta estimator, pi provided by the Diversity1 class. The values must have been computed using the same data set.

Warning: this function is not available for values of n (number of samples) larger than MAX_N_STIRLING. k must be >= 1 and <= n.

The behaviour of the function is not defined if K < 0. The function returns UNDEF if the value cannot be computed, which can happen:

Parameters
  • n -

    number of exploited samples for determining the number of haplotypes.

  • K -

    number of haplotypes obtained with the same data.

  • pi -

    average number of pairwise differences (as theta estimator, per gene).

  • if n is larger than MAX_N_STIRLING;
  • if the sum of probabilities of k values >= K is too close of 0 or 1 (based on the computer’s precision);
  • if pi is 0 (no polymorphism);
  • if K > n (which is an error).
Header: <egglib-cpp/Fs.hpp>

Linkage analysis

PairwiseLD

class

Analyzes linkage disequilibrium for a pair of polymorphic sites.

This class considers a single pair of polymorphic sites at a time. The first method, process(), detects alleles at both sites under consideration and determines whether the pairwise comparison is fit for analysis (based on the presence of polymorphism, and allele frequencies). Statistics are computed by compute() for a given pair of alleles. Letting the user filter out sites for which that are more than two alleles and, if necessary, process multiple pairs of alleles.

One should first process a pair of sites with process(). If the return value is false, one should not process data further. Otherwise, one can access data with num_alleles1(), num_alleles2(), index1(), index2(), freq1(), freq2(), freq(), and nsam(), and can also compute LD with compute() (for a given pair of alleles) and then access to LD estimates.

Public Functions

egglib::PairwiseLD::PairwiseLD()

Default constructor.

egglib::PairwiseLD::~PairwiseLD()

Destructor.

void egglib::PairwiseLD::compute(unsigned int allele1, unsigned int allele2)

Compute D, D’, r and r^2 statistics for a given pair of alleles.

The method process() must have been executed and must have returned true.

Statistics are computed only for a given pair of alleles. If there are only two alleles, all allele pairs result in consistent results. Otherwise, some multi-allele summarizing methodology has to be applied.

allele1 and allele2 are the allele indexes at the first and second site, respectively.

double egglib::PairwiseLD::D()
const

Get the D statistic.

This value is reset to 0 upon call to process().

Requires compute().

double egglib::PairwiseLD::Dp()
const

Get the D’ statistic.

This value is reset to 0 upon call to process().

Requires compute();

unsigned int egglib::PairwiseLD::freq(unsigned int allele1, unsigned int allele2)
const

Get the frequency of a genotype.

The method process() must have been executed and must have returned true.

The indexes must be smaller than the value returned by num_alleles1() and num_alleles2() respectively.

unsigned int egglib::PairwiseLD::freq1(unsigned int allele)
const

Get the frequency of an allele for the first site.

The method process() must have been executed and must have returned true.

The index must be smaller than the value returned by num_alleles1().

unsigned int egglib::PairwiseLD::freq2(unsigned int allele)
const

Get the frequency of an allele for the second site.

The method process() must have been executed and must have returned true.

The index must be smaller than the value returned by num_alleles2().

unsigned int egglib::PairwiseLD::index1(unsigned int allele)
const

Get the index of an allele for the first site.

The method process() must have been executed and must have returned true.

For a given allele, get its index within the original SiteHolder instance. The indexes can be shifted by process() due to missing data.

unsigned int egglib::PairwiseLD::index2(unsigned int allele)
const

Get the index of an allele for the second site.

The method process() must have been executed and must have returned true.

For a given allele, get its index within the original SiteHolder instance. The indexes can be shifted by process() due to missing data.

unsigned int egglib::PairwiseLD::nsam()
const

Get the number of analyzed samples.

The method process() must have been executed and must have returned true.

The returned value might be smaller than the initial number of samples due to missing data.

unsigned int egglib::PairwiseLD::num_alleles1()
const

Get the actual number of alleles at the first site.

The method process() must have been executed and must have returned true.

Gives the number of different alleles at the first site, considering only samples for which both sites have exploitable data.

unsigned int egglib::PairwiseLD::num_alleles2()
const

Get the actual number of alleles at the second site.

The method process() must have been executed and must have returned true.

Gives the number of different afirst lleles at the second site, considering only samples for which both sites have exploitable data.

bool egglib::PairwiseLD::process(const SiteHolder &site1, const SiteHolder &site2, unsigned min_n, double max_maj, const StructureHolder *stru)

Analyze a pair of sites.

The method takes two sites as argument. The two sites must be taken from the same data set. In particular, the sample sizes must be identical. Structure and outgroup are ignored. The indexes of samples must be matching over the two sites. Samples which are missing in either of the samples are skipped. If the remaining samples are less than the argument min_n, the whole computation is dropped. Genotypes are ignored (only alleles are considered).

If this method returns true, statistics might be computed for a given pair of alleles using the compute() method. The number of alleles available for analysis is available at either site using num_alleles1() and num_alleles2(). When returning false, this method stops as early as possible, and the state of the object might be inconsistent. In this case, no accessor must be used and compute() must not be called.

Return
true if computations have been performed, false if the sites fall in one the following cases: not enough samples (based on the min_n argument); either site is fixed; the allele frequencies are too unbalanced with at least one allele at a frequency larger than max_maj.
Note
Due to missing data, a site that is initially polymorphic might appear to be fixed when considering only samples that are not missing for the other site, causing this method to drop the pairwise comparison. Conversely, a site that has more than two alleles might have only two when considering only samples that are not missing for the other site. For this reason, it is not trivial to filter out sites before calling this method, and sites might not be consistently included or rejected.
Parameters
  • site1 -

    first site.

  • site2 -

    second site.

  • min_n -

    minimum number of samples used (this value must always be larger than 1).

  • max_maj -

    maximum relative frequency of the majority allele (if any allele at either site has a frequency larger than this value, the pairwise comparison is dropped).

  • stru -

    a Structure object (used to process a subset of samples). By default, use all samples.

double egglib::PairwiseLD::r()
const

Get the r statistic.

This value is reset to 0 upon call to process().

Requires compute().

void egglib::PairwiseLD::reset()

Reset all values to default.

Call to this method is usually not necessary since process() automatically resets the instance.

double egglib::PairwiseLD::rsq()
const

Get the r^2 statistic.

Same as r()*r(). This value is reset to 0 upon call to process().

Requires compute().

MatrixLD

class

Analyzes linkage disequilibrium between pairs of sites.

This class processes a set of SiteHolder instances and computes linkage disequilibrium for all pairs of sites. A PairwiseLD instance is provided for all comparison, skipping all pairs for which LD cannot be computed (there are several criteria). The approach consists in first calling load() by providing a set of SiteHolder instances. The method computeLD() computes the LD for each pair and computeStats() computes the statistics of Kelly (1997) and Rozas et al. (2001). These statistics are based on the average of pairwise linkage disequilibrium statistics. In addition, computeRmin() computes Rm of Hudson and Kaplan (1985) and does not generate nor use PairwiseLD instances, and it can be used independently.

Public Types

enum egglib::MatrixLD::MultiAllelic

Flags for processing multiallelic sites.

This enum is used to specify what should be done with pairs of sites for which at least one site has more than two alleles.

Values:

Only process pairs of sites with exactly two alleles.

Use the allele with highest frequency.

Use all alleles.

Public Functions

egglib::MatrixLD::MatrixLD()

Constructor.

egglib::MatrixLD::~MatrixLD()

Destructor.

void egglib::MatrixLD::computeLD(unsigned min_n, double max_maj)

Compute LD between all pairs of sites.

Use sites loaded using load() and process all possible pairs. Each pairwise comparison is retained only if all filters are passed (see arguments of this method). After call of this method, the number of pairs can be accessed using num_tot() (it is equal to n(n-1)/2 where n is the number of loaded sites); the number of analyzed pairs can be accessed using num_pairs(); the total number of analyzed allele pairs can be accessed using num_alleles(). Then the method compute() can be called to compute Kelly’s statistics.

Note
Due to missing data, it is not trivial to predict whether a pairwise comparison will be dropped. See the documentation of PairwiseLD::process().
Parameters
  • min_n -

    minimum number of samples used (this value must always be larger than 1).

  • max_maj -

    maximum relative frequency of the majority allele (if any allele at either site has a frequency larger than this value, the pairwise comparison is dropped).

void egglib::MatrixLD::computeRmin(bool oriented)

Computes Hudson and Kaplan’s Rmin.

To be used, this method requires that sites have been loaded in increasing position order. Only sites with exactly two alleles and no missing data at all are used. In addition, if the oriented argument is set to true, only orientable sites are considered. Sites with more than two alleles and sites with any missing data, and sites not orientable if oriented is set to true, are ignored.

When the method has finished, a few methods provide access to the results. Rmin_num_sites() give the number of sites considered for the analysis. If the value is less than two, the statistic itself bears no signification. Rmin() gives the minimum number of recombination events. Finally, all of the non-reductible intervals containing a recombination even can be accessed using the two methods Rmin_left(unsigned int) and Rmin_right(unsigned int). The number of intervals is always Rmin().

Parameters
  • oriented -

    if true, consider only orientable sites and apply the three- (instead four-) gametes rule. If false, ignore all outgroup data and include orientable and non-orientable sites.

void egglib::MatrixLD::computeStats(MultiAllelic multiallelic, unsigned int min_freq)

Computes Kelly’s and Rozas et al.’s statistics.

Computes ZnS, Z*nS and Z*nS* (Kelly 1997), and Za and ZZ (Rozas et al. 2001) on the basis of analyzed site pairs (requires computeLD()). The number of alleles pairs used for computing ZnS, Z*nS and Z*nS* is available as num_allele_pairs(), and the number of allele pairs used for computing Za and ZZ is available as num_allele_pairs_adj(). The statistics must not be used if the corresponding number of allele pairs is 0.

If multiallelic equals to MatrixLD::ignore, only pairs of sites for which both sites have exactly two alleles are processed. In this case, the first allele of each site is considered. If multiallelic is MatrixLD::use_main, the alleles with highest frequency are considered (even if one or both sites have only two alleles). In case of equality, the first allele is considered. If multiallelic is MatrixLD::use_all, then all alleles of all sites are used, and the final statistics are averaged over num_alleles() (rather than num_pairs).

Parameters
  • multiallelic -

    modifies the behaviour of the method (see above).

  • min_freq -

    this flag has an effect only if used in conjunction with MatrixLD::use_all (it is ignored otherwise); if larger than 0, rather than using all alleles, use only those that have a frequency equal to or larger than the given value.

unsigned int egglib::MatrixLD::distance(unsigned int index)
const

Get the distance between sites for a given pair.

Requires that enough pairs have been loaded using load() and that requested index must be smaller than num_pairs(). The distance is returned as an absolute value. It is not possible to determine to which sites the pair index corresponds. If you need it, you might want to use PairwiseLD directly.

unsigned int egglib::MatrixLD::index1(unsigned int allele)
const

Index of first site for a given pair.

See pairLD().

unsigned int egglib::MatrixLD::index2(unsigned int allele)
const

Index of second site for a given pair.

See pairLD(). Note that index2 is always > index1.

void egglib::MatrixLD::load(const SiteHolder &site, double position)

Load a site.

Parameters
  • site -

    One of the sites.

  • position -

    All sites must have a valid position. Positions are required to be increasing. For computing Rmin, positions are ignored (they only are fed back if interval limits are required).

unsigned int egglib::MatrixLD::num_allele_pairs()
const

Number of allele pairs used to compute Kelly’s statistics.

Return the number of allele pairs used by computeStats() to compute Kelly’s ZnS, Z*nS and Z*nS* statistics. If multiallelic equals ignore, this value equals the number of pairs of sites with exactly two alleles each (at most, num_pairs()); if multiallelic was use_main, this value equals num_pairs(); if multiallelic was use_all, this value equals num_alleles(). If the returned value is 0 (no loaded pairs of sites, or no pairs of diallelic sites, if multiallelic was set to MatrixLD::ignore), Kelly’s statistics have been reset to 0 but should then be considered as not computable. If num_allele_pairs() is null, none of the Kelly’s and Rozas et al.’s statistics can be computed.

unsigned int egglib::MatrixLD::num_allele_pairs_adj()
const

Number of allele pairs used to compute Rozas et al.’s statistics.

Return the number of allele pairs used by computeStats() to compute Rozas et al.’s Za and ZZ statistics. See the documentation of num_allele_pairs() for reference. The meaning of this value is similar, except that it applies only to adjacent polymorphic sites (the value hence can only be smaller, or equal is limit cases). If this vallue is 0, Rozas et al.’s statistics have been reset to 0 but should then be considered as not computable.

unsigned int egglib::MatrixLD::num_alleles()
const

Get the total number of allele pairs.

Requires computeLD(). Returns the sum of allele pairs over all analyzed sites (see num_pairs()). The minimum value is twice num_pairs() (since, by definition, there must be at least two alleles at each retained site).

unsigned int egglib::MatrixLD::num_pairs()
const

Get the number of analyzed pairs of sites.

Requires computeLD(). The returned value excludes all pairwise comparisons with no polymorphism failing any other criterion (see the computeLD() method).

unsigned int egglib::MatrixLD::num_tot()
const

Get the total number of pairs of sites.

Requires that site pairs have been processed using computeLD().

const PairwiseLD &egglib::MatrixLD::pairLD(unsigned int index)
const

Get linkage disequilibrium for a given pair of sites.

Requires that enough pairs have been loaded using load() and that requested index must be smaller than num_pairs(). Use methods index1() and index2() to obtain the corresponding site indexes.

unsigned int egglib::MatrixLD::process(unsigned min_n, double max_maj, MultiAllelic multiallelic, unsigned int min_freq, bool oriented)

Call {computeLD() and computeStats()} and/or computeRmin() based on toggled flags.

Arguments are like for the three methods. Enter anything if they are not used.

Return value is a flag with the following bits:

  • 0: ZnS, Z*nS, and Z*nS* are computed.
  • 1: Za and ZZ are computed.
  • 2: Rmin was computed.

void egglib::MatrixLD::reset()

Reset to defaults.

unsigned int egglib::MatrixLD::Rmin()
const

Minimal number of recombination events.

Requires computeRmin().

unsigned int egglib::MatrixLD::Rmin_left(unsigned int i)
const

Left bound of a recombination interval.

Requires computeRmin() and that Rmin_num_sites() is at least 2.

Return
Position of the site at the 5’ end of the given interval (as provided to load()).
Parameters
  • i -

    interval index (there are Rmin() intervals).

unsigned int egglib::MatrixLD::Rmin_num_sites()
const

Number of sites used for computing Rmin.

Requires computeRmin().

Fixed to 0 if there are less than two sites in total (no computation is performed by computeRmin() in that case). If Rmin_num_sites() is less than 2, Rmin() is not defined and fixed to 0.

unsigned int egglib::MatrixLD::Rmin_right(unsigned int i)
const

Right bound of a recombination interval.

Requires computeRmin() and that Rmin_num_sites() is at least 2.

Return
Position of the site at the 5’ end of the given interval (as provided to load()).
Parameters
  • i -

    interval index (there are Rmin() intervals).

void egglib::MatrixLD::toggle_off()

Toggle all off.

void egglib::MatrixLD::toggle_Rmin()

Toggle Rmin.

void egglib::MatrixLD::toggle_stats()

Toggle summary statistics.

double egglib::MatrixLD::Za()
const

Get Rozas et al.s Za statistic.

Requires computeStats(). See documation of this method to known when this value is defined.

double egglib::MatrixLD::ZnS()
const

Get Kelly’s ZnS statistic.

Requires computeStats(). See documation of this method to known when this value is defined.

double egglib::MatrixLD::ZnS_star1()
const

Get Kelly’s Z*nS statistic.

Requires computeStats(). See documation of this method to known when this value is defined.

double egglib::MatrixLD::ZnS_star2()
const

Get Kelly’s Z*nS* statistic.

Requires computeStats(). See documation of this method to known when this value is defined.

double egglib::MatrixLD::ZZ()
const

Get Rozas et al.s ZZ statistic.

Requires computeStats(). See documation of this method to known when this value is defined.

See also the Rd class.

Extended haplotype homozygosity

class

Compute Extended Haplotype Homozygosity statistics.

Compute statistics described in Sabeti et al. (Nature 2002), Voight et al. (PLoS Biology 2006), Ramirez-Soriano et al. (Genetics 2008) and Tang et al. (PLoS Biology 2007).

The user must first load the core haplotype or site using the set_core() method which also allows to specify option values, and then all needed distant sites using load_distant(). Distant sites must be loaded for one side only and with always increasing distance relatively to the core. To load sites of the other side, the user needs to call set_core() again with the same core site in order to reset statistics. Statistics are automatically computed and updated at each loaded distant site. It is required to load at least one valid core site before using accessors.

Header: <egglib-cpp/EHH.hpp>

Public Functions

egglib::EHH::EHH()

Constructor.

virtual egglib::EHH::~EHH()

Destructor.

double egglib::EHH::dEHH(unsigned int haplotype)
const

Get an EHH decay value.

double egglib::EHH::dEHH_max()
const

Get the maximum EHH decay value (computed on the fly)

double egglib::EHH::dEHH_mean()
const

Get the average EHH decay value (on the fly)

double egglib::EHH::dEHHc(unsigned int haplotype)
const

Get an EHHc decay value.

double egglib::EHH::dEHHG()
const

Get an EHHS (genotypes) decay value.

double egglib::EHH::dEHHS()
const

Get an EHHS decay value.

double egglib::EHH::EHHc(unsigned int haplotype)
const

Get an EHHc value.

double egglib::EHH::EHHG()
const

Get an EHHG value.

double egglib::EHH::EHHi(unsigned int haplotype)
const

Get an EHH value.

double egglib::EHH::EHHS()
const

Get an EHHS value.

bool egglib::EHH::flag_EHHG_done()
const

Tell if decay has been reached for EHHG.

bool egglib::EHH::flag_EHHS_done()
const

Tell if decay has been reached for EHHS.

double egglib::EHH::iEG()
const

Get an iEG value.

double egglib::EHH::iES()
const

Get an iES value.

double egglib::EHH::IHH(unsigned int haplotype)
const

Get an IHH value.

double egglib::EHH::IHHc(unsigned int haplotype)
const

Get an IHHc value.

double egglib::EHH::iHS(unsigned int haplotype)
const

Get an iHS value.

unsigned int egglib::EHH::K_core()
const

Number of used haplotypes of the core.

unsigned int egglib::EHH::K_cur()
const

Current number of haplotypes.

void egglib::EHH::load_distant(const SiteHolder *site, double distance)

Load a distant site.

For each core haplotype, compute or update all statistics.

Parameters
  • site -

    the distant site to be loaded. The method will only throw an exception if the number of samples differ.

  • distance -

    between the core and the distant site. The nature of the distance metrics is up to the user but must be consistent over sites. Distances must be >=0 this must correspond to the positions of distant sites relatively to the core region or site. Site must be loaded such as distances must always be increasing.

unsigned int egglib::EHH::num_avail_core(unsigned int)
const

Current number of non-missing samples for a core haplotype.

unsigned int egglib::EHH::num_avail_cur(unsigned int)
const

Current number of non-missing samples for a current haplotype.

unsigned int egglib::EHH::num_avail_tot()
const

Current number of non-missing samples.

unsigned int egglib::EHH::num_EHH_done()
const

Number of haplotypes for which computation of dEHH and IHH has been completed.

unsigned int egglib::EHH::num_EHHc_done()
const

Number of haplotypes for which computation of dEHHc and IHHc has been completed.

double egglib::EHH::rEHH(unsigned int haplotype)
const

Get an rEHH value.

void egglib::EHH::set_core(const SiteHolder *site, bool genotypes, double EHH_thr, double EHHc_thr, double EHHS_thr, double EHHG_thr, unsigned int min_freq, unsigned int min_sam, bool crop)

Load the core site or region.

This method automatically resets the instance (clear all previously computed data and reallocate arrays to proper sizes). The Site instance passed as core is only used by this method. All counters will be incremented, until the next call to set_core(), or eventual destruction of object. All thresholds are understood as either EHH or EHHS values and therefore must lie between 0.0 and 1.0.

Parameters
  • site -

    core site or region. If a region, haplotypes within the core region must have been identified previously and should be loaded as a Site instance. The site may contain missing data. The samples containing missing data at the core site will be ignored for all subsequently loaded distant site.

  • genotypes -

    if true, consider that data are entered as unphased genotypes (the Site instance must have consistent data).

  • EHH_thr -

    threshold EHH value.

  • EHHc_thr -

    threshold EHHc value.

  • EHHS_thr -

    threshold EHHS value.

  • EHHG_thr -

    threshold EHHS (genotypes) value.

  • min_freq -

    minimal absolute frequency for haplotypes (haplotypes with lower frequencies are ignored). Required to be strictly larger than zero.

  • min_sam -

    minimal number of samples to continue computing (applied both within core haplotypes and for the total).

  • crop -

    if True, set values of EHHS that are below the threshold to 0 to emulate the behaviour of the R package rehh (also affects iES).

Helpers

Constants

const unsigned int MAX_N_STIRLING

Maximal n values for pre-computed Stirling numbers.

Header: <egglib-cpp/stirling.hpp>

const unsigned int NUM_STIRLING

Size of the Stirling numbers table.

Header: <egglib-cpp/stirling.hpp>

const double STIRLING_TABLE[500499]

Array of log(|S(n,k)|) (Stirling numbers of the 1st kind)

The values must be accessed using the stirling_table() function.

Header: <egglib-cpp/stirling.hpp>

Functions

double egglib::stirling_table(unsigned int n, unsigned int k)

Get a pre-computed Stirling number of the first kind.

The n parameter must be <= MAX_N_STIRLING and k must be > 0 and <= n.

Header: <egglib-cpp/stirling.hpp>