Core components

Constants

These constant values have use to give special meaning to input parameters or return values of functions. Usually they mean “data not available” or “irrelevant value”. Different names are given to the same value in order to make them more relevant to their context. Read the documentation carefully.

const int MISSINGDATA

Missing data (large value)

const unsigned int MAX

Unknown/undefined (large value)

const unsigned int UNKNOWN

Unknown value (large value)

const unsigned int MISSING

Missing data (large value)

const unsigned int OUTGROUP

Outgroup (large value)

const unsigned long int BEFORE

Value before the first (large value)

const char MAXCHAR

Unknown value (large value)

const double UNDEF

Unknown value (small / very negative value)

Basic classes

DataHolder

class

Integer data set.

Holds a data set with associated sample names and group information. The data consists of given numbers of ingroup and outgroup samples, which can all have a different number of sites, unless the object is configured to be a matrix. In that cases, it is assumed that all samples have the same number of sites as the first loaded sample. There can be any number of group levels (but this number must be the same for all samples), meaning that samples can be described by several group labels in addition to their name. Group labels are not group indices (they do not need to be consecutive). There is a separate data set for sample belonging to the outgroup. There can be any number of outgroup samples. The outgroup has always one level of group labels, but the labels are not initialized. All data are represented by signed integers. Note that none of the accessors performs out-of-bound checking. The user is responsible to provide valid indices. This class follows a memory caching system: allocated memory is never freed with the aim of efficiently reusing the same object.

Header: <egglib-cpp/DataHolder.hpp>

Public Functions

egglib::DataHolder::DataHolder(bool is_matrix)

Default constructor.

Create an empty matrix. The object will be usable only when resizers will have been called.

Parameters
  • is_matrix -

    determines if the object is configured to a matrix (that is, where all ingroup and outgroup samples have the same number of sites). If so, the user is responsible to ensure that all loaded samples are consistent. This value can be changed using the method is_matrix(bool).

egglib::DataHolder::DataHolder(const DataHolder &src)

Copy constructor.

The reserved memory of the source is not copied.

virtual egglib::DataHolder::~DataHolder()

Destructor.

void egglib::DataHolder::clear(bool is_matrix)

Clear object.

Actually clears all memory stored by the object (including cache). All memory vector data are effectively lost and memory is released.

void egglib::DataHolder::del_sample_i(unsigned int sam)

Delete a sample.

Delete the specified sample and decrease index of all subsequent samples by one. If there is only one sequence in the instance, set the number of sites (the maximal number of sites for non-matrix objects) to 0.

void egglib::DataHolder::del_sample_o(unsigned int sam)

Delete an outgroup sample.

Delete the specified sample and decrease index of all subsequent samples by one. If there is only one sequence in the instance, set the number of sites (the maximal number of sites for non-matrix objects) to 0.

void egglib::DataHolder::del_sites(unsigned int start, unsigned int stop)

Delete a given range of sites.

The sites are removed for all ingroup and outgroup samples. The index must be valid. This method may be used for both matrix and non-matrix objects.

If the stop argument is larger than the number of sites (the number of sites for this sample in the case of a non-matrix object), then sites are removed until the end of the sequence. If the start argument is larger or equal to the number of sites (the number of sites for this sample in the case of a non-matrix object), then nothing is done.

Parameters
  • start -

    start position of the range to remove.

  • stop -

    stop position of the range to remove (this site IS NOT removed).

void egglib::DataHolder::del_sites_i(unsigned int sam, unsigned int start, unsigned int stop)

Delete a given range of sites for an ingroup sample.

As del_sites(unsigned int, unsigned int) but for a single sample. This method may not be called on a matrix object.

void egglib::DataHolder::del_sites_o(unsigned int sam, unsigned int start, unsigned int stop)

Delete a given range of sites for an outgroup sample.

As del_sites(unsigned int, unsigned int) but for a single sample. This method may not be called on a matrix object.

unsigned int egglib::DataHolder::find(unsigned int sam, bool of_outgroup, VectorInt &motif, unsigned int start, unsigned int stop)
const

Find the start position of the first match of a motif.

Return
The index of the start positon of the first exact match for the passed set of values, or egglib::MAX if no match was found (within the specified region).
Parameters
  • sam -

    sample index.

  • of_outgroup -

    specifies whether the sample is in the outgroup.

  • motif -

    the list of integers representing the motif to be found.

  • start -

    at which to start search. No returned value will be smaller than this value.

  • stop -

    position at which to stop search (the motif cannot overlap this position). No returned value will be larger than stop - n.

unsigned int egglib::DataHolder::get_group_i(unsigned int sam, unsigned int lvl)
const

Get a group label.

The indices must be valid, otherwise a segmentation fault or aberrant behaviour will occur.

Parameters
  • sam -

    sample index.

  • lvl -

    group level index.

unsigned int egglib::DataHolder::get_group_o(unsigned int sam)
const

Get the group label for an outgroup sample.

The indexe must be valid, otherwise a segmentation fault or aberrant behaviour will occur. There is necessarily one group level for outgroups, and the default value is 0.

Parameters
  • sam -

    sample index.

int egglib::DataHolder::get_i(unsigned int sam, unsigned int sit)
const

Get an ingroup data entry.

The indices must be valid, otherwise a segmentation fault or aberrant behaviour will occur.

Parameters
  • sam -

    sample index.

  • sit -

    site index.

bool egglib::DataHolder::get_is_matrix()
const

Check if the object is configured to be a matrix.

const char *egglib::DataHolder::get_name_i(unsigned int sam)
const

Get an ingroup name.

const char *egglib::DataHolder::get_name_o(unsigned int sam)
const

Get an outgroup name.

unsigned int egglib::DataHolder::get_ngroups()
const

Get the number of group levels.

unsigned int egglib::DataHolder::get_nsam_i()
const

Get the number of ingroup samples.

unsigned int egglib::DataHolder::get_nsam_o()
const

Get the number of outgroup samples.

unsigned int egglib::DataHolder::get_nsit()
const

Get the number of sites.

This method may not be called on a non-matrix object.

unsigned int egglib::DataHolder::get_nsit_i(unsigned int sam)
const

Get the number of sites for an ingroup sample.

This method may not be called on a matrix object.

unsigned int egglib::DataHolder::get_nsit_o(unsigned int sam)
const

Get the number of sites for an outgroup sample.

This method may not be called on a matrix object.

int egglib::DataHolder::get_o(unsigned int sam, unsigned int sit)
const

Get an outgroup data entry.

The indices must be valid, otherwise a segmentation fault or aberrant behaviour will occur.

Parameters
  • sam -

    sample index.

  • sit -

    site index.

void egglib::DataHolder::insert_sites(unsigned int pos, unsigned int num)

Insert sites at a given position.

Increase the number of sites for all samples. This method may be used for matrix or non-matrix objects. Note that the inserted sites are not initialized.

Parameters
  • pos -

    the position at which to insert sites. The new sites are inserted before the specified index. Use 0 to add sites at the beginning of the sequence, and the current number of sites to add sites at the end. If the value is larger than the current number of sites, sites are added at the end of the sequence. Therefore it is possible to use egglib::MAX as the position to specify that new sites must be inserted at the end.

  • num -

    number of sites at add.

void egglib::DataHolder::insert_sites_i(unsigned int sam, unsigned int pos, unsigned int num)

Insert sites at a given position for an ingroup sample.

As insert_sites(unsigned int, unsigned int, int) but for only one sample of the ingroup. Available only for non-matrix objects.

void egglib::DataHolder::insert_sites_o(unsigned int sam, unsigned int pos, unsigned int num)

Insert sites at a given position for an outgroup sample.

As insert_sites(unsigned int, unsigned int, int) but for only one sample of the outgroup. Available only for non-matrix objects.

bool egglib::DataHolder::is_equal()
const

Test if all sequences have the same length.

True if no sequences at all. Only valid for containers.

void egglib::DataHolder::name_append_i(unsigned int sam, const char *ch)

Add character at the end of specified ingroup name.

void egglib::DataHolder::name_append_o(unsigned int sam, const char *ch)

Add characters at the end of specified ingroup name.

void egglib::DataHolder::name_appendch_i(unsigned int sam, char ch)

Add a character to the specified ingroup name.

void egglib::DataHolder::name_appendch_o(unsigned int sam, char ch)

Add a character to the specified outgroup name.

DataHolder &egglib::DataHolder::operator=(const DataHolder &src)

Assignment operator.

The reserved memory of the source is not copied. The reserved memory of the current object is retained.

void egglib::DataHolder::reserve(unsigned int nsi, unsigned int nso, unsigned int ln, unsigned int ng, unsigned int ls)

Reserve memory to speed up data loading.

This method does not change the size of the data set contained in the instance, but reserves memory in order to speed up incremental loading of data. The passed values are not required to be accurate. In case the instance has allocated more memory than what is requested, nothing is done (this applies to both dimensions independently). It is always valid to use 0 for any values (in that case, nothing is done). Note that one character is always pre-allocated for all names.

Parameters
  • nsi -

    expected number of ingroup samples.

  • nso -

    expected number of outgroup samples.

  • ln -

    expected length of names.

  • ng -

    expected number of groups.

  • ls -

    expected number of sites (the same for all ingroup and outgroup samples, whichever the object is a matrix or not).

void egglib::DataHolder::reset(bool is_matrix)

Restore object to initial state.

This method is designed to allow reusing the object and reusing previously allocated memory. All data contained in the instance is considered to be lost, but allocated memory is actually retained to speed up later resizing operations.

void egglib::DataHolder::set_group_i(unsigned int sam, unsigned int lvl, unsigned int label)

Set a group label.

The indices must be valid, otherwise a segmentation fault or aberrant behaviour will occur.

Parameters
  • sam -

    sample index.

  • lvl -

    group level index.

  • label -

    group label.

void egglib::DataHolder::set_group_o(unsigned int sam, unsigned int label)

Set the group label for an outgroup sample.

The indexe must be valid, otherwise a segmentation fault or aberrant behaviour will occur. There is necessarily one group level for outgroups.

Parameters
  • sam -

    sample index.

  • label -

    group label.

void egglib::DataHolder::set_i(unsigned int sam, unsigned int sit, int value)

Set an ingroup data entry.

The indices must be valid, otherwise a segmentation fault or aberrant behaviour will occur.

Parameters
  • sam -

    sample index.

  • sit -

    site index.

  • value -

    allele value.

void egglib::DataHolder::set_is_matrix(bool flag)

Configure the object (not) to be a matrix.

If a non-matrix object is converted to a matrix, the user is responsible of ensuring that all samples (including outgroup) have the same number of samples. The method will assume that all samples have the same number of sites as the first sample among the ingroup and outgroup. There is no requirement for converting a matrix to a non-matrix.

void egglib::DataHolder::set_name_i(unsigned int sam, const char *name)

Set an ingroup name.

void egglib::DataHolder::set_name_o(unsigned int sam, const char *name)

Set an outgroup name.

void egglib::DataHolder::set_ngroups(unsigned int ngrp)

Set the number of group levels.

Perform memory allocation as needed but does not initialize new values.

void egglib::DataHolder::set_nsam_i(unsigned int nsam)

Set the number of ingroup samples.

Perform memory allocation as needed but does not initialize new values (except names). If the object is a matrix, new samples are set to have the current number of sites. Otherwise, new samples have no sites. Set the number of samples to a smaller value equals to remove the last samples.

void egglib::DataHolder::set_nsam_o(unsigned int nsam)

Set the number of outgroup samples.

As nsam(unsigned int) but for the outgroup data table.

void egglib::DataHolder::set_nsit(unsigned int val)

Set the number of sites.

Perform memory allocation as needed but does not initialize new values. It is possible to use this method for both matrix and non-matrix objects. In both cases, the effective result is that all ingroup and outgroup samples are resized to the specified number of sites.

void egglib::DataHolder::set_nsit_i(unsigned int sam, unsigned int val)

Set the number of sites for an ingroup sample.

Similar to nsit(unsigned int) but for only one sample of the ingroup. Available only for non-matrix objects.

void egglib::DataHolder::set_nsit_o(unsigned int sam, unsigned int val)

Set the number of sites for an outgroup sample.

Similar to nsit(unsigned int) but for only one sample of the outgroup. Available only for non-matrix objects.

void egglib::DataHolder::set_o(unsigned int sam, unsigned int sit, int value)

Set an outgroup data entry.

The indices must be valid, otherwise a segmentation fault or aberrant behaviour will occur.

Parameters
  • sam -

    sample index.

  • sit -

    site index.

  • value -

    allele value.

void egglib::DataHolder::to_outgroup(unsigned int sam, unsigned int label)

Move a sample to the outgroup.

The specified sample is moved to the outgroup and its group labels are discarded. Obviously, this decreases the ingroup size by 1, and increases the outgroup size accordingly.

Parameters
  • sam -

    ingroup sample index.

  • label -

    group label to assign to the sample after it it is moved to the outgroup (use 0 if not relevant).

bool egglib::DataHolder::valid_phyml_aa()
const

True if all data are amino acids expected by PhyML (for alignment)

bool egglib::DataHolder::valid_phyml_names()
const

True if all names are got for PhyMl.

bool egglib::DataHolder::valid_phyml_nt()
const

True if all data are nucleotides expected by PhyML (for alignment)

GeneticCode

class

Hold genetic code tables.

Handle genetic codes. All genetic codes defined by the National Centor for Biotechnology Information are supported.

Header: <egglib-cpp/GeneticCode.hpp>

Public Functions

egglib::GeneticCode::GeneticCode(unsigned int index)

Constructor.

Build an instance of the genetic code.

egglib::GeneticCode::GeneticCode()

Default constructor.

Like GeneticCode(int), except that the code used is 1 (standard).

char egglib::GeneticCode::aminoacid(unsigned int codon)
const

Returns the translation of a codon.

The codon should be passed as an integer code (see codon()). This methods returns the single amino acid code for valid codons (represented by integers in the range 0-63), and ‘X’ for any other integer.

unsigned int egglib::GeneticCode::get_code()
const

Get the current genetic code.

const char *egglib::GeneticCode::name()
const

Get the name of the genetic code.

double egglib::GeneticCode::NSsites(unsigned int codon, bool ignorestop)
const

Give the number of non-synonymous sites of a codon.

The number is in the range 0-3 (3 is all changes at all of the three positions would lead to a non-synonymous change).

Parameters
  • codon -

    codon integer code (see codon()).

  • ignorestop -

    if true, potential changes to stop codons are excluded and all stop codons return 0; if false, changes to stop codons are considered to be non-synonymous.

double egglib::GeneticCode::NSsites(const SiteHolder &site1, const SiteHolder &site2, const SiteHolder &site3, unsigned int &num_samples, bool ignorestop, int A, int C, int G, int T)
const

Give the number of non-synonymous sites of a codon site.

The number is in the range 0-3 (3 is all changes at all of the three positions would lead to a non-synonymous change). This is the same as NSistes(unsigned int, bool), but average over all samples based on provided Site instances.

All three codon positions must have the same number of samples such as the ith nucleotides at the three sites give the codon for the ith sample. (Same ploidy as well.)

Parameters
  • site1 -

    first position of the codon site.

  • site2 -

    second position of the codon site.

  • site3 -

    third position of the codon site.

  • num_samples -

    variable used to provide the number of samples analyzed by the method (that is, number of samples minus number of samples containing at least one missing data). The original value of the variable is ignored and is modified by the instance. If 0, the return value should be ignored.

  • ignorestop -

    if true, potential changes to stop codons are excluded and all stop codons are treated as missing data; if false, changes to stop codons are considered to be non-synonymous.

  • A -

    integer value representing bases A.

  • C -

    integer value representing bases C.

  • G -

    integer value representing bases G.

  • T -

    integer value representing bases T.

All nucleotides represented by an integer allele values not matching either of the four values passed as the A, C, G and T arguments are considered as missing data.

Warning
This method will not accept coding sequences mixing upper and lower case characters. It is however possible to configure how the four nucleotides are represented.

double egglib::GeneticCode::NSsites(const SiteHolder &codons, unsigned int &num_samples, bool ignorestop, int A, int C, int G, int T)
const

Give the number of non-synonymous sites of a codon site.

See NSsites(const SiteHolder&, const SiteHolder&, const SiteHolder&, unsigned int&, bool, int, int C, int, int). Do the same, except that this version takes a single SiteHolder reference instance of three, and the SiteHolder reference passed to this function contains integer alleles representing codons, that is in the range 0-63. Other values are considered to be missing data.

void egglib::GeneticCode::set_code(unsigned int index)

Set the genetic code.

double egglib::GeneticCode::Ssites(unsigned int codon, bool ignorestop)
const

Give the number of synonymous sites of a codon.

The number is in the range 0-3 (3 is all changes at all of the three positions would lead to a non-synonymous change).

Parameters
  • codon -

    codon integer code (see codon()).

  • ignorestop -

    if true, potential changes to stop codons are excluded and all stop codons return 0; if false, changes to stop codons are considered to be non-synonymous.

double egglib::GeneticCode::Ssites(const SiteHolder &site1, const SiteHolder &site2, const SiteHolder &site3, unsigned int &num_samples, bool ignorestop, int A, int C, int G, int T)
const

Like NSsites, for synonmous sites.

double egglib::GeneticCode::Ssites(const SiteHolder &codons, unsigned int &num_samples, bool ignorestop, int A, int C, int G, int T)
const

Like NSsites, but for synonymous sites.

bool egglib::GeneticCode::start(unsigned int codon)
const

Tells if a codon is an initiation codon.

The codon should be passed as an integer code (see codon()). This methods returns true if the codon in an initiation codon (including any of the alternative initiation codons known for the genetic code of the set for the current object), and false otherwise (including for invalid codon codes).

int egglib::GeneticCode::translate(int first, int second, int third, bool smart)

Translate a codon directly.

Codon positions should be ASCII-coded. Return ‘X’ if missing data or invalid nucleotides. For fourfold degenerate positions. Codons including non-ambiguity characters always return ‘X’ (even ‘?’ at a fourfold degenerate position), except if the codon is ‘’ (in that case, ‘-‘ is returned).

Return
ASCII-coded aminoacid.
Parameters
  • first -

    first codon position.

  • second -

    second codon position.

  • third -

    third codon position.

  • smart -

    smart translation.

Public Static Functions

static char egglib::GeneticCode::base(unsigned int codon, unsigned int index)

Returns one of the base of a codon.

This method can be called on the class directly (as in GeneticCode::base(0, 0) and it is not dependent on the specification of a genetic code).

Return
Returns the character at the specified position of the codon (as an upper-case character).
Warning
The methods returns ‘?’ if it cannot perform base extraction, but it is not guaranteed that all invalid arguments will be detected properly.
Parameters
  • codon -

    the codon should be passed as an integer code (see codon()). Only values in the range 0-63 are supported.

  • index -

    index of the base to extract (only 0, 1 and 2 are accepted; other values will result in aberrant outcome).

unsigned int egglib::GeneticCode::codon(char first, char second, char third)

Return the integer code for a codon.

The first, second and third bases of the codon must be passed as character arguments. The case of characters is ignored. Returns and integer in the range [0, 63] for the 64 codons (see the table below).

The codons are identified by single integers as given by the table below:

Warning
The base ‘U’, although biologically relevant, is treated as an invalid base.
Note
As a static method, this method can be called as GeneticCode::codon(base1, base2, base3) directly (it does not require instanciation of an object) and it is not dependent on any genetic code specification.

All other triplets: egglib.UNKNOWN.

static bool egglib::GeneticCode::diff1(unsigned int codon1, unsigned int codon2)

Check if the first position of two codons is identical.

static bool egglib::GeneticCode::diff2(unsigned int codon1, unsigned int codon2)

Check if the second position of two codons is identical.

static bool egglib::GeneticCode::diff3(unsigned int codon1, unsigned int codon2)

Check if the third position of two codons is identical.

static unsigned int egglib::GeneticCode::int2codon(unsigned int base1, unsigned int base2, unsigned int base3)

Return the integer code for a codon.

The first, second and third bases of the codon must be passed as integer code, according to the following mapping: 0 for T, 1 for C, 2 for G and 3 for T. This code must absolutely be following and no other value may be passed. Returns and integer in the range [0, 63] for the 64 codons (see the documentation of the method codon(char, char, char)).

Note
As a static method, this method can be called as GeneticCode::int2codon(base1, base2, base3) directly (it does not require instanciation of an object) and it is not dependent on any genetic code specification.

static unsigned int egglib::GeneticCode::ndiff(unsigned int codon1, unsigned int codon2)

Returns the number of nucleotide differences between two codons.

This method can be called on the class directly (as in GeneticCode::base(0, 0) and it is not dependent on the specification of a genetic code).

This method is only valid if both arguments are less than

  1. Returns only 0, 1, 2 or 3.

unsigned int egglib::GeneticCode::num_codes()

Get the number of available codes.

Random

class

Pseudo-random number generator.

This class implements the Mersenne Twister algorithm for pseudo-random number generation. It is based on work by Makoto Matsumoto and Takuji Nishimura (see http://www.math.sci.hiroshima-u.ac.jp/~m-MAT/MT/emt.html) and Jasper Bedaux (see http://www.bedaux.net/mtrand/) for the core generator, and the Random class of Egglib up to 2.2 for conversion to other laws than uniform.

Note that different instances of the class have independent chain of pseudo-random numbers. If several instances have the same seed, they will generate the exact same chain of pseudo-random numbers. Note that this applies if the default constructor is used and that instances are created within the same second.

All non-uniform distribution laws generators are based either on the rand_int32() or the standard (half-open, 32 bit) uniform() methods.

Header: <egglib-cpp/Random.hpp>

Public Functions

egglib::Random::Random()

Constructor with default seed.

Uses the current system clock second as seed.

egglib::Random::Random(unsigned long s)

Constructor with custom seed.

Favor large, high-complexity seeds. When using different instances of Random in a program, or different processes using Random, ensure they are all seeded using different seeds.

virtual egglib::Random::~Random()

Destructor.

unsigned long egglib::Random::binomrand(long n, double p)

Draws a number from a binomial law.

Parameters
  • n -

    number of tests (must be >=0).

  • p -

    test probability.

bool egglib::Random::brand()

Boolean integer.

Return true with probability 0.5.

double egglib::Random::erand(double expectation)

Draws a number from an exponential distribution.

Beware, the argument is the distribution’s mean (and is also 1/lambda where lambda is the rate parameter).

unsigned long egglib::Random::get_seed()
const

Get seed value.

Return the value of the seed that was used to initiate the instance. If the generator was re-seeded, return the seed value passed at that point.

unsigned int egglib::Random::grand(double param)

Draws a number from a geometric law.

The argument is the geometric law parameter.

unsigned int egglib::Random::irand(unsigned int ncards)

Draws a uniform integer.

The argument is the number of values that can be generated. Returns an integer in the range [0, ncards-1]. Therefore, ncards is not included in the range.

double egglib::Random::nrand()

Draws a number from a normal distribution.

Return a normal variation with expectation 0 and standard deviation 1. The algorithm used is the polar form of the Box-Muller algorithm. A draw is performed every two calls unless the instance is re-seeded.

unsigned int egglib::Random::prand(double p)

Draws an integer from a Poisson distribution.

The argument is the Poisson distribution parameters.

unsigned long egglib::Random::rand_int32()

Generate a 32-bit random integer.

Returns an integer in the range [0, 4294967295] (that is in the range [0, 2^32-1].

void egglib::Random::set_seed(unsigned long s)

Re-seed an instance.

Favor large, high-complexity seeds. When using different instances of Random in a program, or different processes using Random, ensure they are all seeded using different seeds.

double egglib::Random::uniform()

Generate a real in the half-open interval [0,1)

0 is included but not 1.

double egglib::Random::uniform53()

Generate a 53-bit real.

The value has increased precision: even uniform integer pseudo-random numbers can take a finite number of values (2^32 of them, that is). This method increases the complexity of return values, with a cost as increased computing time.

double egglib::Random::uniformcl()

Generate a real in the closed interval [0,1].

Both 0 and 1 are included.

double egglib::Random::uniformop()

Generate a real in the open interval (0,1)

Neither 0 nor 1 is included.

Model fitting utilities

ABC

class

Model estimation by Approximate Bayesian Computation.

It is required to set the number of statistics and at least one input file name before performing analysis. The analysis itself consists in several steps. (1) Computation of the threshold, which requires to read through all files and imports statistics. In the process, the standard deviation of all statistics will be calculated and will be available. (2) Computation of Euclidean distances and weights and generation of a second sample file with weights and (non-standardized) statistics with only non-null weights exported. This step requires that the observed statistics have been set (between steps 1 and 2). (3) Local-linear regression using the fit method. While in the previous steps several models can be mixed, in this step a single model can be processed at a time. The output is a simple file with adjusted parameters only.

Public Types

enum egglib::ABC::TransformMode

Modes for parameter transformation.

Values:

Public Functions

egglib::ABC::ABC()

Constructor.

egglib::ABC::~ABC()

Destructor.

void egglib::ABC::add_fname(const char *fname, unsigned int number_of_params)

Adds a file name.

void egglib::ABC::get_threshold(double tolerance)

Gets the regression threshold.

If several files are loaded, the data will be aggregated (note that they must all contain the same number of statistics). At least one file must have been set, and the number of statistics must have been set as well.

Parameters
  • tolerance -

    rejection threshold (proportion of points in the local region.

unsigned int egglib::ABC::number_of_samples()
const

Gets number of imported data samples.

unsigned int egglib::ABC::number_of_samples_part(unsigned int i)
const

Gets number of imported data samples for a given file.

void egglib::ABC::number_of_statistics(unsigned int ns)

Sets number of statistics.

If data was already present in the instance, it will all be cleared.

void egglib::ABC::obs(unsigned int index, double value)

Sets a summary statistics.

The number of statistics must have been set, and the index must not be out of bound.

unsigned int egglib::ABC::regression(const char *infname, const char *outfname, TransformMode mode, const char *header)

Performs regression step.

Return
Number of data point processed
Parameters
  • infname -

    input file name (generated using rejection)

  • outfname -

    output file name (final posterior)

  • mode -

    transformation mode

  • header -

    outfile file header (name of parameters; if empty string, no header is printed)

unsigned int egglib::ABC::rejection(const char *outfname, bool exportlabels, bool strip)

Performs rejection step.

The observed value must have been entered (otherwise the results will be meaningless), and the threshold must have been computed.

Return
the number of points in the local region.
Parameters
  • outfname -

    the name of the intermediary file.

  • exportlabels -

    if true: exports a tag at the beginning of each line to identify the file or origin of each accepted sample (starting from 1).

  • strip -

    if true: remove statistics and weights (only export statistics of accepted points; then the file cannot be used for regression).

double egglib::ABC::sd(unsigned int index)
const

Gets a standard deviation.

The get_threshold() method must have been called, and the index must not be out of bound.

double egglib::ABC::threshold()
const

Gets Euclidean threshold.

Neutral networks

class

Training and prediction with neural networks.

This classes implements the back propagation algorithm for training neural network.

The constructor of Neural generates a bare, unusable instance. It is required set up the network with the setup() method and initialize weights (normally using the init_weights() method, or by setting manually all weights) before calling train(). More information is available in the documentation of the setup() method.

Header: <egglib-cpp/Neural.hpp>

Public Functions

egglib::nnet::Network::Network()

Constructor.

virtual egglib::nnet::Network::~Network()

Destructor.

void egglib::nnet::Network::init_weights(double range_input, double range_output)

Initialize weights.

It is required to call this method before calling training, otherwise the weights have undefined values. All weights are initialized to random values from a uniform distribution in the range [-X, X] with X is the range_input argument for all weights connecting input variables to neurons of the hidden layer, and the range_output arguments for all weights connecting neurons of the hidden layer to the output neurons.

Parameters
  • range_input -

    bound for first-level weights.

  • range_output -

    bound for second-level weights.

unsigned int egglib::nnet::Network::num_iter()
const

Get the number of training iterations.

The internal counter is reset by setup().

void egglib::nnet::Network::predict(const Data &data, unsigned int pattern, bool compute_error)

Use the neural network to predict output.

The network must have been trained using a Data instance with the same number of input and output variables. This method generates an array of predicted values based on the current values of the weight (after training) that can be accessed using the prediction() method. If the compute_error argument is true, the error is accessed using error().

Parameters
  • data -

    data set with the correct number of input (always) and output variables (unless compute_errors is false).

  • pattern -

    pattern to process.

  • compute_error -

    if false, don’t compute errors, and output variables are not considered.

double egglib::nnet::Network::prediction(unsigned int output_var)
const

Get a predicted output.

The predict() method must have been called. Get the predicted value for one of the output variables. Warning: training (using the method train()) modifies the predicted values and invalidates the results of the method predict().

Parameters
  • output_var -

    index of the output variable (must be smalled than the number of output variables defined by both the Data instances used for training and for prediction).

void egglib::nnet::Network::setup(const Data &training_data, unsigned int training_start, unsigned int training_stop, const Data &testing_data, unsigned int testing_start, unsigned int testing_stop, unsigned int num_hidden, ActivationFunction fun1, ActivationFunction fun2, double rate1, double rate2, double bound, double momentum, Random *random)

Setup the network.

Upon call to this method, the network is set up based on the number of input and output variables of the provided data set, and the specified number of neurons in the hidden layer. Note that all neurons (from both the hidden and output layers) are automatically connected to an additional neuron generating a constant input of 1.0 (the bias).

Loading a training data set (using the train_data) is required. The first and last-plus-one indexes must be passed to indicate which patterns must be processed for training. Note that the pattern corresponding to the train_stop argument is not included. It is possible to use a train_stop value larger than the number of patterns in train_data. To use all patterns, use train_start=0 and train_stop=egglib::MAX.

Loading a test data set is not mandatory but advisable. The test data set is not used for training but allows to evaluate the predictive ability of the network. To skip this option, pass any data set as test_data (for example, the same as train_data) and set test_start >= test_stop.

Parameters
  • training_data -

    a training data set with at least one input variable, at least one output variable, at least one pattern and all data loaded. The object passed must not modified until training is finished.

  • training_start -

    index of the first pattern to process for training.

  • training_stop -

    index of the pattern immediately after the last pattern to process for training.

  • testing_data -

    a data set to use for testing the predictive ability of the network (not used for fitting). The data set must have the same number of input and output variables as train_data (preferably it will the same object, this non-overlapping ranges of patterns to process).

  • testing_start -

    index of the first pattern to process for testing.

  • testing_stop -

    index of the pattern immediately after the last pattern to process for test.

  • num_hidden -

    number of neurons in the hidden layer of the network.

  • fun1 -

    activation function to use for neurons of the hidden layer.

  • fun2 -

    activation function to use for output neurons.

  • rate1 -

    rate for weights from input to hidden layer.

  • rate2 -

    rate for weights form hidden layer to output.

  • bound -

    extreme value (for both signs) for weight values.

  • momentum -

    proportion of the previous weight change to apply to all weight changes (use 0.0 to skip momentum).

  • random -

    Random object to use for generating random numbers (only used for initial weights).

double egglib::nnet::Network::testing_error()
const

Get total error for testing data.

As training_error() but for the testing data set. Only valid if testing data has been passed to setup().

void egglib::nnet::Network::train(unsigned int num_iter)

Train the neural network.

Train the network for a fixed number of iterations. This method must be called iteratively until a given stop criterion is fulfilled. After each call to train(), it is possible to access the the current values of weights, the error for training data and, if testing data have been loaded, the error for testing data.

double egglib::nnet::Network::training_error()
const

Get total error for training data.

The error is computed sqrt(sum((pred[i]-obs[i])^2)) where sqrt is the square root function, sum is the sum over all output nodes, pred[i] is the predicted value provided by output neuron i and obs[i] is the observed value for output variable i. Computed using the training data only. Requires train() but also modified if predict() is called with compute_error=true.

double egglib::nnet::Network::weight_hidden(unsigned int i, unsigned int j)
const

Value of a weight to a hidden neuron.

Access to the current value of the weight of the connexion of a hidden neuron to an input variable or to the bias neuron. The bias weight’s index is equal to the number of input variables.

Parameters
  • i -

    index of the hidden neuron.

  • j -

    index of the input variable.

void egglib::nnet::Network::weight_hidden(unsigned int i, unsigned int j, double value)

Set the value of a weight to a hidden neuron.

Set the value of the weight of the connexion of a hidden neuron to an input variable or to the bias neuron. The bias weight’s index is equal to the number of input variables.

Parameters
  • i -

    index of the hidden neuron.

  • j -

    index of the input variable.

  • value -

    weight value.

double egglib::nnet::Network::weight_output(unsigned int i, unsigned int j)
const

Value of a weight to an output neuron.

Access to the current value of the weight of the connexion of an output neuron to hidden neuron or to the bias neuron. The bias weight’s index is equal to the number of hidden neurons.

Parameters
  • i -

    index of the output neuron.

  • j -

    index of the hidden neuron.

void egglib::nnet::Network::weight_output(unsigned int i, unsigned int j, double value)

Set the value of a weight to an output neuron.

Set the value of the weight of the connexion of an output variable to a hidden neuron or to the bias neuron. The bias weight’s index is equal to the number of hidden neurons.

Parameters
  • i -

    index of the output neuron.

  • j -

    index of the hidden neuron.

  • value -

    weight value.

class

Neuron, that is a node of a neural network

Header: <egglib-cpp/Neural.hpp>

Public Functions

egglib::nnet::Neuron::Neuron()

Create a neuron.

The neuron is created naked, empty and unusable. The user must use config() before doing anything with it. To reuse a neuron after it has been used (e.g. for training again the same network), it is required to call config() again.

virtual egglib::nnet::Neuron::~Neuron()

Delete a neuron.

void egglib::nnet::Neuron::activate()

Activate the neuron.

Process all incoming connexions, apply weights and the activation function to generate the output.

void egglib::nnet::Neuron::config(ActivationFunction fun, unsigned int num_input)

Set up the neuron.

The object is reset, but previously data may not have been reinitialized.

Parameters
  • fun -

    function used for activation of this neuron.

  • num_input -

    number of incoming connexions (either neurons of the previous layer or input variables) of this neuron.

double egglib::nnet::Neuron::delta(unsigned int index)
const

Get the delta value for a given weight.

The propagate() method must have been called, which itself requires that the neuron had been previously loaded with all needed data and activated. This method returns the change of the specified weight value based on the propagated error. The delta value is computed as f’(I) * E[index] where f’ is the derivative of the activation function, I is the input value for this neuron and E[index] is the error associated to the weight in question.

double egglib::nnet::Neuron::get_output()
const

Collect the output.

Get the output value. Calling this method does not update the output if any input data has changed (make sure to call the activate() method for this).

double egglib::nnet::Neuron::get_weight(unsigned int index)
const

Get a weight.

Get the current valueo of the weight applied to a given incoming connexion.

void egglib::nnet::Neuron::propagate(unsigned int index, double val)

Propagate error for a given weight.

This method can only be used with neurons that have loaded input and have been activated. After the error has been propagated, the delta value can be obtained using delta(), typically for propagating errors to the previous layer. The delta value is the value passed as val to this method multiplied by the value derivative of the activation function at the current input value of this neuron. Note that the neuron’s weight are not modified by this method, but only when update() is called.

Parameters
  • index -

    index of one of the weights of this neuron.

  • val -

    propagated error for this weight (for an output neuron: this neuron’s error times the output value of the corresponding neuron of the hidden layer; for a hidden neuron: the sum of delta values for all output neurons, weighted by the corresponding weights.

void egglib::nnet::Neuron::set_input(unsigned int index, double value)

Set an input value.

Load a value for a given incoming connexion. Until changed, the value will be remembered. Changing the value does not update the output value of this neuron.

void egglib::nnet::Neuron::set_weight(unsigned int index, double value)

Set a weight.

Load the value of the weight to be applied to a given incoming connexion. Until changed, the value will be remembered. Changing the value does not update the output value of this neuron.

void egglib::nnet::Neuron::update(double rate, double bound, double momentum)

Update all weights.

The propagate() method must have been called for all weights. This method applies all delta values.

Parameters
  • rate -

    learning rate.

  • bound -

    absolute limit value: weights are bound to the range [-bound, +bound].

  • momentum -

    proportion of the previous weight change to apply to the new change.

class

Holds input and output data for training neural networks.

When creating a data set, the user must first set the number of patterns with num_patterns(), of input variables with num_input() and of output variables with num_output(). Only then it is possible to load data using input() and output(). Make sure to load every declared slots as data are not initialized.

Header: <egglib-cpp/Neural.hpp>

Note
In case a same object must be reused will smaller values of num_input and/or num_output more larger num_patterns, it is more efficient to call the methods num_input() and num_output() before num_patterns().

Public Functions

egglib::nnet::Data::Data()

Constructor.

egglib::nnet::Data::Data(const Data &src)

Copy constructor.

virtual egglib::nnet::Data::~Data()

Destructor.

double egglib::nnet::Data::get_input(unsigned int pattern, unsigned int variable)
const

Get an input data item.

unsigned int egglib::nnet::Data::get_num_input()
const

Get number of input variables.

unsigned int egglib::nnet::Data::get_num_output()
const

Get number of output variables.

unsigned int egglib::nnet::Data::get_num_patterns()
const

Get number of patterns.

double egglib::nnet::Data::get_output(unsigned int pattern, unsigned int variable)
const

Get an output data item.

double egglib::nnet::Data::mean_input(unsigned int index)
const

Get the mean for an input variable.

If data have been normalized using the method normalize(), this method returns the mean of a given input variable in the original data.

double egglib::nnet::Data::mean_output(unsigned int index)
const

Get the mean for an output variable.

If data have been normalized using the method normalize(), this method returns the mean of a given output variable in the original data.

void egglib::nnet::Data::normalize()

Normalize data for all input and output variables.

All data are modified permanently and the average and standard deviation for each input and output variable are saved and remain accessible using the methods mean_input(), mean_output(), std_input() and std_output(), until normalize() is called again or the number of input or output variables or the number of patterns is modified.

void egglib::nnet::Data::normalize_input(unsigned int index, double mean, double std)

Normalize data for an input variable.

All data are modified permanently. The passed mean and standard deviation are not saved.

void egglib::nnet::Data::normalize_output(unsigned int index, double mean, double std)

Normalize data for an output variable.

All data are modified permanently. The passed mean and standard deviation are not saved.

Data &egglib::nnet::Data::operator=(const Data &src)

Copy assignment operator.

void egglib::nnet::Data::set_input(unsigned int pattern, unsigned int variable, double value)

Load an input data item.

void egglib::nnet::Data::set_num_input(unsigned int num)

Set number of input variables.

Invalidate all means and standard deviations if the data was previously normalized.

void egglib::nnet::Data::set_num_output(unsigned int num)

Set number of output variables.

Invalidate all means and standard deviations if the data was previously normalized.

void egglib::nnet::Data::set_num_patterns(unsigned int num)

Set number of patterns.

Invalidate all means and standard deviations if the data was previously normalized.

void egglib::nnet::Data::set_output(unsigned int pattern, unsigned int variable, double value)

Load an output data item.

double egglib::nnet::Data::std_input(unsigned int index)
const

Get the standard deviation for an input variable.

If data have been normalized using the method normalize(), this method returns the standard deviation of a given input variable in the original data.

double egglib::nnet::Data::std_output(unsigned int index)
const

Get the standard deviation for an output variable.

If data have been normalized using the method normalize(), this method returns the standard deviation of a given output variable in the original data.

void egglib::nnet::Data::unnormalize_output(unsigned int index, double mean, double std)

Unnormalize data for an output variable.

All data are modified permanently. The passed mean and standard deviation are not saved.

Utilities

IntersperseAlign

class

Insert non-varying sites within alignments.

This class allows to add non-varying sites within an alignment at given positions. The procedure below must be strictly followed:

  • Create an instance. The constructors takes no arguments.
  • Load a DataHolder instance using load(). It is required that it is an alignment (a matrix). The instance will create an array of positions internally.
  • Specify the desired length of the final alignment using set_length().
  • Specify the position of all sites of the original alignment. This can be achieved by three ways:
    • Specify manually all positions as real numbers using set_position().
    • Specify manually all positions as extant indexes (as integer values) using set_round_position() for all positions. If this approach is used, it is necessary to set the round option of intersperse to false.
    • Pass the reference to the Coalesce instance that has generated the alignment (assuming it is a simulation) and let the instance find itself the site positions, with the method get_positions().
  • Specify the list of alleles values used for non-varying positions using set_num_alleles() and then set_allele() as many times as needed. If there is more than one allele, non-varying alleles will be picked randomly. This method can be skipped (by default, the value corresponding to A will be used).
  • Provide the address of a random number generator using set_random() (it is always needed).
  • Call intersperse(). This will change the DataHolder originally loaded.

Header: <egglib-cpp/DataHolder.hpp>

Public Functions

egglib::IntersperseAlign::IntersperseAlign()

Constructor.

egglib::IntersperseAlign::~IntersperseAlign()

Destructor.

void egglib::IntersperseAlign::get_positions(const Coalesce &coalesce)

Gets automatically the positions of sites of the original alignment.

A DataHolder reference must have been loaded using load(), and this DataHolder object must be the last one simulated using the Coalesce object whose reference is passed. This method will load the positions of all sites as provided by Coalesce.

void egglib::IntersperseAlign::intersperse(bool round_positions)

Insert non-varying sites.

This method modifies the DataHolder instance that has been loaded. It is required to have loaded one, and to have specified the position of each of its sites. It is also logical (but not formally required) to have specified the desired length of the alignment. It is possible to specify more than one alleles for inserted positions. It is required to have passed a random number generator.

After call to this method, the loaded DataHolder instance will have a length equal to the value specified using set_length(), unless the original DataHolder was longer (in such case, it is not changed at all).

Parameters
  • round_positions -

    a boolean indicating if site positions must be rounded. Set it to false if already rounded positions have been provided.

void egglib::IntersperseAlign::load(DataHolder &data)

Loads a data set.

The loaded data set may contain any number of sites (even zero).

This method does not reset the random number generator, the final alignment length, the position of sites (unless the new alignment has a different number of sites compared with the previous one) or the number and value of non-varying alleles.

void egglib::IntersperseAlign::set_allele(unsigned int index, int allele)

Sets an allele for inserted positions.

The number of alleles must have been fixed using set_num_alleles(). The default value for the first allele is A.

void egglib::IntersperseAlign::set_length(unsigned int length)

Specifies the desired length of the final alignment.

If this method is skipped, the default value is 0 (the final alignment is identical to the original one), or the previously specified value (if set_length() has been called previously).

void egglib::IntersperseAlign::set_num_alleles(unsigned int num)

Specifies the number of possible alleles at inserted positions.

This method may be called at any time. The value must be at least one. If more than one, alleles at inserted positions will be picked randomly (they will be fixed among samples). All alleles must be specified using set_allele(). The default value is one and the default first allele is A. It is possible not to specify the first allele even if the number of alleles is increased (the A value will be retained).

void egglib::IntersperseAlign::set_position(unsigned int index, double position)

Sets the position of one of the sites of the original alignment.

A DataHolder reference must have been loaded using load(). This method allows to specify the position of each of the sites of the passed DataHolder instance. Note that the position of all sites must be specified, that positions must always be increasing (consecutive positions might be equal), and all positions must be at least 0 and at most 1.

void egglib::IntersperseAlign::set_random(Random *random)

Provides a random number generator.

It is always required to provide a random number generator, even if set_num_alleles() is one.

void egglib::IntersperseAlign::set_round_position(unsigned int index, unsigned int position)

Sets the position of one of the sites of the original alignment.

A DataHolder reference must have been loaded using load(). This method allows to specify the position of each of the sites of the passed DataHolder instance. Note that the position of all sites must be specified, that positions must always be increasing (consecutive positions might be equal), and all positions must be at least 0 and at most ls-1 where ls is the length of the final alignment.

If you use this method, you must use it for all sites and then set the argument of intersperse() to false.

VectorInt

class

Minimal reimplementation of a vector<int>

Header: <egglib-cpp/DataHolder.hpp>

Public Functions

egglib::VectorInt::VectorInt()

Constructor (default: 0 values)

egglib::VectorInt::VectorInt(const VectorInt &src)

Copy constructor.

virtual egglib::VectorInt::~VectorInt()

Destructor.

void egglib::VectorInt::clear()

Release memory.

int egglib::VectorInt::get_item(unsigned int i)
const

Get a value.

unsigned int egglib::VectorInt::get_num_values()
const

Get the number of values.

VectorInt &egglib::VectorInt::operator=(const VectorInt &src)

Copy assignment operator.

void egglib::VectorInt::set_item(unsigned int i, int value)

Set a value.

void egglib::VectorInt::set_num_values(unsigned int n)

Set the number of vqlues (values are not initialized)

Exceptions

EggException

class

Base exception type for errors occurring in this library.

Header: <egglib-cpp/egglib.hpp>

Public Functions

egglib::EggException::EggException()

Constructor with empty error message.

egglib::EggException::EggException(const char *message)

Creates the exception.

egglib::EggException::~EggException()

Destructor.

virtual const char *egglib::EggException::what()
const

Gets error message.

EggArgumentValueError

class

Exception type for argument value errors.

Header: <egglib-cpp/egglib.hpp>

Public Functions

egglib::EggArgumentValueError::EggArgumentValueError(const char *m)

Creates the exception.

egglib::EggArgumentValueError::~EggArgumentValueError()

Destructor.

EggFormatError

class

Exception type for file/string parsing errors.

Header: <egglib-cpp/egglib.hpp>

Public Functions

egglib::EggFormatError::EggFormatError(const char *fileName, unsigned int line, const char *expectedFormat, const char *m, char c, const char *paste_end)

Creates the exception.

egglib::EggFormatError::~EggFormatError()

Destructor.

char egglib::EggFormatError::character()

Get character.

const char *egglib::EggFormatError::info()

Get additional information field.

unsigned int egglib::EggFormatError::line()

Get line number.

const char *egglib::EggFormatError::m()

Get bare error message (before formatting)

EggInvalidAlleleError

class

Exception type for invalid allele.

Header: <egglib-cpp/egglib.hpp>

Public Functions

egglib::EggInvalidAlleleError::EggInvalidAlleleError(int c, unsigned int seqIndex, unsigned int posIndex)

Creates the exception.

egglib::EggInvalidAlleleError::~EggInvalidAlleleError()

Destructor.

EggMemoryError

class

Exception type for memory errors.

There is a macro EGGMEM which stands for EggMemoryError(LINE, FILE).

Header: <egglib-cpp/egglib.hpp>

Public Functions

egglib::EggMemoryError::EggMemoryError(unsigned int line, const char *file)

Creates the exception.

egglib::EggMemoryError::~EggMemoryError()

Destructor.

EggOpenFileError

class

Exception type for errors while opening a file.

Header: <egglib-cpp/egglib.hpp>

Public Functions

egglib::EggOpenFileError::EggOpenFileError(const char *fileName)

Creates the exception.

egglib::EggOpenFileError::~EggOpenFileError()

Destructor.

EggPloidyError

class

Exception type for inconsistent ploidy over individuals EggInvalidChromosomeError

Header: <egglib-cpp/egglib.hpp>

Public Functions

egglib::EggPloidyError::EggPloidyError()

Creates the exception.

egglib::EggPloidyError::~EggPloidyError()

Destructor.

EggRuntimeError

class

Exception type for runtime errors.

Runtime error definition is rather large. Includes bugs as well as logical errors.

Header: <egglib-cpp/egglib.hpp>

Public Functions

egglib::EggRuntimeError::EggRuntimeError(const char *m)

Creates the exception.

egglib::EggRuntimeError::~EggRuntimeError()

Destructor.

EggUnalignedError

class

Exception type for unaligned sequences.

Header: <egglib-cpp/egglib.hpp>

Public Functions

egglib::EggUnalignedError::EggUnalignedError()

Creates the exception.

egglib::EggUnalignedError::~EggUnalignedError()

Destructor.