Data file formats

Exporting classes

Base class

class

Abstract Base class for exporters.

Public Functions

egglib::BaseFormatter::BaseFormatter()

Constructor.

Sets to console output.

virtual egglib::BaseFormatter::~BaseFormatter()

Destructor.

void egglib::BaseFormatter::close()

Close the output file.

If there is no open file, nothing is done. Otherwise the current output file is closed and any subsequent output will be directed to the standard output. It is not required to call this method between to consecutive calls to open_file() (in order to change the output file). The file is also closed properly when this object is destroyed.

void egglib::BaseFormatter::flush()

Flush the current output.

std::string egglib::BaseFormatter::get_str()

Access the internal string buffer.

Returns the current string stored in the output string buffer. This does not reset the output buffer (one must call to_str() again to initialize a new output string.

bool egglib::BaseFormatter::open_file(const char *fname)

Open a file for writing.

All subsequent formatting operations will be done using this file as output, until close() is called or open_file() again (to create a new file). By default (if open_file() is never called), output goes to the standard output.

If an output file was already open, it is closed prior opening the new one (it is not necessary to call close()).

Return
A boolean indicating whether the file has been opened sucessfully.

void egglib::BaseFormatter::to_cout()

Restore the default (export to standard output)

This closes the file if one is open.

void egglib::BaseFormatter::to_str()

Write to a string buffer.

All subsequent formatting operations will be done into an internally-stored string buffer which can be accessed using string(). This closes the file if one is open. This clears the string buffer if one has already been opened.

void egglib::BaseFormatter::write(const char *bit, bool eol)

Write a line as is.

A newline is automatically if the argument eol is true. It is legal to pass an empty string.

ms & newick

class

Holder class for ms-type and newick tree formatting methods.

This class cannot be built. Only method can be called.

Header: <egglib-cpp/Export.hpp>

Public Functions

egglib::Export::Export()

Constructor.

egglib::Export::~Export()

Destructor.

void egglib::Export::ms(const DataHolder &data, bool spacer, bool outgroup)

Export data as ms format

The number of sites must have been previously (and to the correct value) using either ms_num_positions() (with all positions specified with ms_position()) or ms_auto_positions(). The format is as follow: one line with two slashes; one line with the number of sites; one line with the positions, or an empty line if the number of sites is zero; then the matrix of genotypes (one line per sample), only if the number of sites is larger than zero.

Parameters
  • data -

    the data set to export.

  • spacer -

    if True, insert a space between each genotype value.

  • outgroup -

    if True, outgroup data is exported after the main group.

void egglib::Export::ms_auto_positions(unsigned int n)

Assign default positions to all sites.

This method should be called before calling ms() in order to specify the number of sites, if the position are not defined or irrelevant. The argument must be the number of sites and must match the number of sites in the DataHolder that will be passed to ms(). The positions will be evenly spread between 0 and 1.

void egglib::Export::ms_num_positions(unsigned int n)

Specifies the number of positions for ms exporting.

This method should be called before calling ms() in order to specify the number of sites. This value must match the number of sites in the DataHolder that will be passed to ms(), and all positions must be specified with set_positions() after call to this method.

void egglib::Export::ms_position(unsigned int site, double position)

Specifies the position of a site for ms exporting.

This method should be called before calling ms() in order to specify the position of each site. The number of sites must have been specified previously using ms_num_positions(). The position must be <0 and >1.

void egglib::Export::newick(const Tree &tree, bool blen, bool eol)

Write a newick-formatted tree.

Write the data as a single line, complete with its newline character. It is required that leaves are properly and hierarchically connected up to the root; non-network structure).

Parameters
  • tree -

    a completed genealogical tree with labelled leaves.

  • blen -

    whether to export the value of branch lengths.

  • eol -

    whether to print a newline character after the tree.

Fasta

class

Fasta formatter.

Write genetic data to a file, a string or standard output using the fasta format (formally described in FastaParser). It is required that all exported allele values are exportable as characters (in particular, negative values are never allowed). See the methods documentation for more details, in particular set_mapping() to understand how to map alleles to user-specified characters (such as mapping 0, 1, 2, 3 to A, C, G, T, for example).

Header: <egglib-cpp/Fasta.hpp>

Public Functions

egglib::FastaFormatter::FastaFormatter()

Constructor.

Parametrization of the instance can be performed using the setter methods (their names start with set_). The default values of the argument of these method represent the default value of the options By default, output is sent to the standard output. A (new) output file can be created at any time using open_file().

Concerning the arguments first and last, please note that sequences will be imported if last < first, or if first is larger than the last index. If last is larger than the last index, all last sequences are exported and no error is caused. The default values of first and last ensure that all sequences are exported.

virtual egglib::FastaFormatter::~FastaFormatter()

Destructor.

Destroys the object. If an output file is currently open, it is closed at this point.

void egglib::FastaFormatter::defaults()

Sets all parameters to defaults.

void egglib::FastaFormatter::set_first(unsigned int first)

Sets index of the first sample to export.

void egglib::FastaFormatter::set_labels(bool labels)

Specifies whether the group labels should be exported.

void egglib::FastaFormatter::set_last(unsigned int last)

Sets index of the last sample to export.

void egglib::FastaFormatter::set_linelength(unsigned int linelength)

Sets the length of sequence line.

If zero, the whole sequence is written on a single line.

void egglib::FastaFormatter::set_mapping(const char *mapping)

Specifies whether a character mapping should be used.

Use the specified list of characters to map integer allelic values. If a non-empty string is provided, the length of the string must be larger than the largest possible allele values. In that case, the allele values will be used as indexes in order to determine which char from this string must be used for outputting. In the case that this method is used with an empty string, the mapping will not be used and the allele values will be casted directly to characters.

void egglib::FastaFormatter::set_outgroup(bool outgroup)

Specifies whether the outgroup should be exported.

void egglib::FastaFormatter::set_shift_groups(bool shift_groups)

Specifies whether group labels should be shifted.

If true, all group labels are shift of one unit (positively) when exporting.

void egglib::FastaFormatter::write(const DataHolder &src)

Write fasta-formatted data.

The parameters specified by the last call to config() (or the defaults) apply). If an output file has be opened with open_file(), data are written into this file. Otherwise, data are written to the standard output.

std::string egglib::FastaFormatter::write_string(const DataHolder &src)

Write fasta-formatted data to string.

As to_stream() but generates and and returns a string. If there is an open file, it is not touched.

Fasta importing utilities

Fasta parsing class

class

Sequence-by-sequence Fasta parser.

Read fasta-formatted sequence data from a file specified by name or from an open stream. See the description of the format below.

  • Each sequence is preceded by a header limited to a single line and starting by a “>” character.
  • The header length is not limited and all characters are allowed but white spaces and special characters are discouraged. The header is terminated by a newline character.
  • Group labels are specified a special markup system placed at the end of the header line. The labels are specified by an at sign (“@” followed by any integer value (“\@0”, “\@1”, “\@2” and so on). It is allowed to define several group labels for any sequence. In that case, integer values must be enter consecutively after the at sign, separated by commas, as in “\@1,3,2” for a sequence belonging to groups 1, 3 and 2 in three different grouping levels. Multiple grouping levels can be used to specify hierarchical structure, but not only (independent grouping structure can be freely specified). The markup “\@#” (at sign or hash sign) specifies an outgroup sequence. The hash sign may be followed by a single integer to specify a unique group label. Multiple grouping levels are not allowed for the outgroup. The group labels of the ingroup and the outgroup are independent, so the same labels may be used. The at sign can be preceded by a unique space. In that case, the parser automatically discards one space before the at sign (both “>name@1” and “>name @1” are read as “name”) but if there are more than one space, additional spaces are considered to be part of the name. By default, no grouping structure is assumed and all sequences are assumed to be part of the ingroup.
  • Group indices are ignored unless specifically specified in a parser’s options.
  • The sequence itself continues on following lines until the next “>” character or the end of the file.
  • White spaces, tab and carriage returns are allowed at any position. They are ignored unless for terminating the header line. There is no limitation in length and different sequences can have different lengths.
  • Characters case is preserved and imported. Note that, when groups is true and that sequences are placed in a DataHolder instance, their position in the original fasta file is lost. Exporting to fasta will automatically place them at the end of the file.

Header: <egglib-cpp/Fasta.hpp>

Public Functions

egglib::FastaParser::FastaParser()

Constructor.

The constructor does not generate an object ready for use. Call to open or set methods is needed before starting to parse data.

egglib::FastaParser::~FastaParser()

Destructor.

char egglib::FastaParser::ch(unsigned int index)
const

Get a character of the last read sequence.

Get the value of a specified index of the last sequence read by the read() method. The index must be valid.

void egglib::FastaParser::clear()

Actually clears the memory of the instance.

Actually frees the memory of the instance. This is useful if a large sequence have been read, in order to really free memory.

void egglib::FastaParser::close()

Close the opened file.

This method closes the file that was opened using the open_file() method. If the file was open using the open_file() method of the same instance, it is actually closed. If the file was passed as a stream using set_stream(), it is forgotten but not closed. If no stream is present, this method does nothing.

bool egglib::FastaParser::good()
const

Check if the instance is good for reading.

Return true if an open stream is available and if the last reading operation (or by default opening) found that the next character is a ‘>’.

unsigned int egglib::FastaParser::group(unsigned int index)
const

Get a group label of the last read sequence.

Get one of the group label specified for the last sequence read by the read() method. The index must be valid.

unsigned int egglib::FastaParser::group_o()
const

Get the outgroup’s label.

Undefined if outgroup() returns false. The default value (in case the label was “@#”) is 0.

unsigned int egglib::FastaParser::ls()
const

Get the length of the last read sequence.

Return the length of the last sequence read by the read() method. By default, the value is 0.

const char *egglib::FastaParser::name()
const

Get the last read name.

Return a c-string containing the name of the last sequence read by the read() method. By default, an empty string is returned.

unsigned int egglib::FastaParser::ngroups()
const

Get the number of group labels specified for the last read sequence.

Return the number group labels specified for the last sequence read by the read() method. By default, the value is 0.

void egglib::FastaParser::open_file(const char *fname)

Open a file for reading.

This method attempts to open the specified file and to read a single character. If the file cannot be open, an EggOpenFileError exception is thrown; if the read character is not ‘>’, an EggFormatError exception is thrown; if the file is empty, no exception is thrown.

In case the instance was already processing a stream, it will be dismissed. The stream created by this method will be closed if another stream is created or set (call to open_file() or set_stream() methods), if the close() method is called or upon object destruction.

Parameters
  • fname -

    name of the fasta-formatted file to open.

bool egglib::FastaParser::outgroup()
const

Check if the last read sequence is part of the outgroup.

Return true if the last sequence read by the read() method is labelled as outgroup. If this method returns true, it is necessary that ngroups() returns 0.

void egglib::FastaParser::read_all(bool groups, DataHolder &dest)

Read a multiple sequences into a DataHolder.

This method calls read() repetitively passing the DataHolder reference which is filled incrementally, until the end of the fasta stream is reached. If the DataHolder instance already contains sequences, new sequences are appended at the end. Warning: dest must absolutely be a non-matrix.

Parameters
  • groups -

    if false, any group labels found in sequence headers will be ignored.

  • dest -

    destination where to place read data.

void egglib::FastaParser::read_sequence(bool groups, DataHolder *dest)

Read a single sequence.

If the argument dest is NULL (default):

Read a sequence from the stream and load it in the object memory. Read data can be accessed using name(), ch() and group() methods (plus outgroup() and group_o() for an outgroup sequence). Note that memory allocated for storing data is retained after subsequent calls to read() (but not data themselves). This means that subsequent sequences will be read faster. It also means that, after reading a long sequence, memory will be used until destruction of the object or call to clear() method. Note that read data will be lost as soon as the current stream is dismissed (using the close() method), or a new stream is opened or set, or clear() is caller, or read() is called again with a NULL dest argument, but not if read() is called with a non-NULL dest argument.

If the argument dest is not NULL:

Read a sequence from the stream and load it into the passed DataHolder instance. This will result in the addition of one sequence to the DataHolder. If the argument groups is true, the destination DataHolder might be modified. New group levels will be added as needed to accomodate the label(s) found in the sequence. In case the destination instance already contained samples, they will be assumed to belong to group 0 for all levels where they were not specified. Warning: dest must absolutely be a non-matrix.

In either case:

If no data can be read (no open stream, stream closed or reached end of file), an EggRuntimeError exception will be thrown.

Return
The number of read characters in sequence.
Parameters
  • groups -

    if false, any group labels found in sequence headers will be ignored.

  • dest -

    if not NULL, destination where to place read data (otherwise, data are stored within the current instance).

void egglib::FastaParser::reserve(unsigned int ln, unsigned int ls, unsigned int ng, unsigned int lf)

Reserve memory to speed up data loading.

This method does not change the size of the data set contained in the instance, but reserves memory in order to speed up subsequent loading of data. The passed values are not required to be accurate. In case the instance has allocated more memory than what is requested, nothing is done (this applies to all parameters independently). It is always valid to use 0 for any values (in that case, no memory is pre allocated for the corresponding array, and memory will be allocated when needed). Notethat one character is always pre-allocated for all names.

Parameters
  • ln -

    expected length of name.

  • ls -

    expected length of sequence.

  • ng -

    expected number of groups.

  • lf -

    expected length of file name.

void egglib::FastaParser::set_stream(std::istream &stream)

Pass an open stream for reading.

This method sets the passed stream (which is supposed to have been opened for reading) and attempts to read a single character. If the stream is not open or if data cannot be read from it, an EggArgumentValueError (and not EggOpenFileError) exception is thrown; if the read character is not ‘>’, an EggFormatError exception is thrown; if no data is found, no exception is thrown.

In case the instance was already processing a stream, it will be dismissed. The stream passed by this method not be closed by the class even when calling close().

Parameters
  • stream -

    open stream to read fasta-formatted sequences from.

void egglib::FastaParser::set_string(const char *str)

Pass a string for reading.

This method opens a reading stream initialized on the passed string and attempts to read a single character. If data cannot be read, an EggArgumentValueError (and not EggOpenFileError) exception is thrown; if the read character is not ‘>’, an EggFormatError exception is thrown; if no data is found, no exception is thrown.

In case the instance was already processing a stream, it will be dismissed.

Parameters
  • str -

    a string to be read.

Helper functions

void egglib::read_fasta_file(const char *fname, bool groups, DataHolder &dest)

Multi-sequence fasta parser (from file)

Read fasta-formatted sequence data from a file specified by name. For format specification, see the documentation of the class FastaParser, which is used behind the scenes.

Note that, for optimal performance, the read_multi() method of FastaParser requires only one FastaParser instance (the best is to re-use a single DataHolder instance to take advantage of memory caching).

Header: <egglib-cpp/Fasta.hpp>

Parameters
  • fname -

    name of the fasta-formatted file to read.

  • groups -

    boolean specifying whether group labels should be imported or ignored. If true, group labels are stripped from names and missing labels are replaced by 0.

  • dest -

    reference to the instance where to place sequences. If the object already contains sequences, new sequences will be appended to it. In any case, the destination object must always be a non-matrix.

void egglib::read_fasta_string(const std::string str, bool groups, DataHolder &dest)

Multi-sequence fasta parser (from string)

Read fasta-formatted sequence data from a raw string. For format specification, see the documentation of the class FastaParser, which is used behind the scenes.

Header: <egglib-cpp/Fasta.hpp>

Parameters
  • str -

    string containing fasta-formatted sequences.

  • groups -

    boolean specifying whether group labels should be imported or ignored. If true, group labels are stripped from names and missing labels are replaced by 0.

  • dest -

    reference to the instance where to place sequences. If the object already contains sequences, new sequences will be appended to it. In any case, the destination object must always be a non-matrix.

VCF importing class

VCF class

class

Line-by-line variant call format (VCF) parser.

Read VCF-formatted variation data from a file specified by name or from an open stream.

This parser supports the 4.1 specification of the variant call format as described at this address:

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

Upon opening a file (or passing an open stream), VcfParser objects automatically read all meta-information until the header (included) and stop before reading first data item. To open a file use the open_file() method, and to pass an open stream to VCF data use set_stream() (take care in the latter case that the file must be open and that the first line to be read must be the “fileformat” specification line). Both methods import information from the header. It is possible to pass a string containing the header only with read_header() but then it will be necessary to passed all lines separately with read_line().

After reading the header, several items of information are made available through methods: file format (string identifying the format used to encode data), which is required, and optional meta-information fields which can be multiple and are identified by their ID: INFO (specifying information fields relative to each variable position), FORMAT (specifying information fields relative to each sample for a given variable position), FILTER (identifies criteria used to filter variable positons) and ALT (identifies pre-defined alternate alleles). The fileformat string is available through the file_format() method. For INFO specifications, num_info() gives the number of INFO specifications and info() gives access to a given index (the equivalent methods exist for FORMAT, FILTER and ALT). The accessors return dedicated classes. If it also possible to use find_info() and equivalent methods who look to a specification by its ID. Before parsing the header, the object loads a number of pre-defined INFO, FORMAT and ALT specifications as defined in the VCF 4.1 format definition. If the file contains a specification matching an existing one (either from pre-defined specifications or from the file itself), it overwrites it. In addition, it is possible to use the methods add_info() and similar to add user-specified definitions. Beware, then, that specification loaded from the file might overwrite them. All specifications are reset upon opening or setting a new file. Meta-information lines that do not use the “fileformat”, “INFO”, “FORMAT”, “FILTER” and “ALT” keys fall into the default “meta” category and can be accessed using the num_meta(), meta() and find_meta() methods, and set/modified using add_meta() method. Finally, the header line must define the number of samples (if any). The number of samples and their names are accessible using the methods num_samples() and sample().

The method allow_X() switches support (be default no support) for using X or x instead of a base in alternate alleles. This is not allowed in VCF specification format 4.1 but some software does actually use it. If X is allowed and one is found, the alternate type will be set to vcf::X (not vcf::Default) and the corresponding allele string will be X (regardless of the original case).

The method allow_gap() switches support (be default no support) for gap (-) as a valid base (allow both reference and alternate alleles to either contain or be a gap symbol. This is not allowed in VCF specification format 4.1 and its use is discouraged.

Each call to read() or read_line() processes a single variant position. Each further call invalidates data stored from the last read operation. At any moment, the method bool() tells whether the underlying file stream is good for reading, but does not guarantee that the next read operation will succeed. If no data is left to read, the method read() returns false. Many formatting errors will be intercepted and result in a EggFormatError exception specifying the line number and as much information as possible.After reading a line, a number of information items are available. Note that allocated memory is never freed unless explicitly requested, therefore speeding up the processing of large files (this also applies when several files are processed in a row).

List of information available after reading a VCF line:

  • Chromosome (or other molecule) name: chromosome().
  • Position of the variant on the chromosome: position(). Note that the first position is 0, and that the first telomere is represented by the constant egglib::BEFORE.
  • The list of IDs defined for this variant can be accessed using num_ID() and ID(). The number of IDs can be 0, 1 or more.
  • Reference allele: reference(). The reference allele is represented by one or more bases A, C, G, T or N. The number of bases is directly accessible throught len_reference().
  • Alternate alleles: num_alternate(), alternate_type() and alternate(). There must be at least one alternate allele. The alternate alleles can be represented by different types (see the documentation).
  • Variant quality score: quality().
  • The list of failed tests can be analyzed using num_failed_tests() and failed_test(). If all tests passed, the number of failed tests is 0. If no tests were performed, the number of failed tests is set to egglib::UNKNOWN (which is a very large positive value).
  • An arbitrary number (including none) of INFO fields can be available. These INFO are separated between Flag, Integer, Float, Character and String. The number of INFO items falling into each category and each item can be accessed using num_FlagInfo() and FlagInfo() methods, respectively (and equivalent for other types). The types used are FlagInfo, TypeInfo<int> (for Integer), TypeInfo<double> (for Float), TypeInfo<char> (for Character) and StringInfo for String. If a non-specified INFO field is used, its type is set to String.
  • If the INFO fields AN, AC, AF and AA are defined and match if their definition is conform to the standard definitions, their value is directly accessible using dedicated members. The booleans has_AN(), has_AC(), has_AF() and has_AA() allow to test if data are available. The counters num_AC() and num_AF() return the number of entries (which must be equal to the number of alternate alleles). The value are accessible through AN(), AC(), AF() and AA().
  • If, and only if, more than 0 samples are defined, sample-specific description fields are available. The fields are described by IDs that normally are defined in the header or in pre-defined types (as FORMAT specification). Undefined types are not allowed. The methods num_field() and field() allow to explore the IDs of FORMAT fields used. As for INFO fields, they are sorted by type (except that there is not Flag type for FORMAT specifications). To get the index of a FORMAT field amongst fields of its types, use the overloaded methods field_rank(). The method sample_info() provides an object of the class SampleInfo that contains all FORMAT fields for a given sample (identified by its index). The type, and the type-specific index returned by field_rank(), are required to get the value corresponding to a given FORMAT specification.

Header: <egglib-cpp/VCF.hpp>

Public Functions

egglib::VcfParser::VcfParser()

Constructor.

The constructor does not generate an object ready for use. Call to open or set methods is needed before starting to parse data. The constructor automatically imports a set of pre-defined INFO and FORMAT specification as specified by the format standard.

virtual egglib::VcfParser::~VcfParser()

Destructor.

unsigned int egglib::VcfParser::AA_index()
const

Get the index of the allele given as AA if defined.

If AA is not available, returns an undefined value. If AA is available but the ancestral allele is not determined (provided as a missing value), returns UNKNOWN. If AA is available but not one of the alleles for this site, returns the next valid index (num_alternate + 1).

const char *egglib::VcfParser::AA_string()
const

Get the INFO field AA if available.

If AA is not available, returns an undefined value. If AA is available but the ancestral allele is not determined (provided as a missing value), return “?”.

unsigned int egglib::VcfParser::AC(unsigned int i)
const

Get an AC value if defined.

If AC is not available, or if the index is over the value returned by num_AC(), this method might cause a crash.

void egglib::VcfParser::add_alt(const char *id, const char *descr)

Add an alternative allele entry.

If an alternative allele with this ID already exists, it will be overwritten.

void egglib::VcfParser::add_filter(const char *id, const char *descr)

Add a filter entry.

If a filter with this ID already exists, it will be overwritten.

void egglib::VcfParser::add_format(const char *id, unsigned int num, vcf::Info::Type type, const char *descr)

Add a format entry.

Flag is not permitted as a type.

If a format with this ID already exists, it will be overwritten.

void egglib::VcfParser::add_info(const char *id, unsigned int num, vcf::Info::Type type, const char *descr)

Add an info entry.

If an info allele with this ID already exists, it will be overwritten.

void egglib::VcfParser::add_meta(const char *key, const char *val)

Add a meta-information entry.

If a meta-information with this key already exists, it will be overwritten.

double egglib::VcfParser::AF(unsigned int i)
const

Get an AF value if defined.

If AF is not available, or if the index is over the value returned by num_AF(), this method might cause a crash.

void egglib::VcfParser::allow_gap(bool flag)

Switch support for - as a valid base in reference/alternate alleles.

This call will affect all subsequent reading operations but the default value will be restored at the next call to set_stream(), open_file(), read_header(), or reset(). The default is false.

void egglib::VcfParser::allow_X(bool flag)

Switch support for X as an alternate allele.

This call will affect all subsequent reading operations but the default value will be restored at the next call to set_stream(), open_file(), read_header(), or reset(). The default is false.

const char *egglib::VcfParser::alternate(unsigned int i)
const

Get one of the alternate alleles for the last read variant.

If alternate_type(i) is Default, the returned string is the allele itself. If alternate_type(i) is Referred, it is its ID, without angle brackets < > (it must be present in the meta-information). If alternate_type(i) is Breakend, then it is a breakend specification, reproduced as is.

vcf::AltType egglib::VcfParser::alternate_type(unsigned int i)
const

Get the type of one of the alternate alleles for the last read variant.

unsigned int egglib::VcfParser::AN()
const

Get the AN value if defined.

If AN is not available, returns an undefined value.

const vcf::TypeInfo<char> &egglib::VcfParser::CharacterInfo(unsigned int)
const

Get a Character-type INFO entry for the last read variant.

const char *egglib::VcfParser::chromosome()
const

Get the last read record chromosome.

void egglib::VcfParser::clear()

Actually clears the memory of the instance.

Actually frees the memory of the instance. This method must not be used while reading a file.

void egglib::VcfParser::clear_stream()

The current value of the flags is overwritten.

const char *egglib::VcfParser::failed_test(unsigned int i)
const

Get the ID of one of the failed tests for the last read variant.

If alternate_type(i) is Default, the returned string is the allele itself. If alternate_type(i) is Referred, it is its ID, without angle brackets < > (it must be present in the meta-information). If alternate_type(i) is Breakend, then it is a breakend specification, reproduced as is.

Warning, do not iterate over the value returned by num_failed_tests() without checking that it is not equal to egglib::UNKNOWN.

const vcf::Format &egglib::VcfParser::field(unsigned int i)
const

Get a sample FORMAT field specification for the last read variant.

Return the corresponding FORMAT specification as a reference to a vcf::Format instance.

unsigned int egglib::VcfParser::field_index(const char *ID)
const

Get the index of a sample FORMAT field specification by its ID for the last read variant.

This is equivalent to looping over the field(unsigned int) method until the type() method of the returned value matches the specified ID. Returns egglib::UNKNOWN if the ID is not found.

unsigned int egglib::VcfParser::field_rank(unsigned int i)
const

Get the rank of a sample FORMAT field for the last read variant.

The index returned by this method gives the rank of the corresponding FORMAT field among fields of the same type. See sample_info(unsigned int) to understand why it is useful. This methods is faster than using field_rank(const char *).

unsigned int egglib::VcfParser::field_rank(const char *ID)
const

Get the rank of a sample FORMAT field for the last read variant using its ID.

The index returned by this method gives the rank of the corresponding FORMAT field among fields of the same type. See sample_info(unsigned int) to understand why it is useful. The behaviour is not defined if the ID is not found. This method performs a search operation, and using field_rank(unsigned int) is faster.

long long int egglib::VcfParser::file_end()

Check the index of the EOF of the VcfFile loaded.

Return
the stream position of the end of file.

const char *egglib::VcfParser::file_format()
const

File format string of the current file.

By default, if no VCF file has been set, returns an empty but valid string.

vcf::Alt *egglib::VcfParser::find_alt(const char *id)

Find an alternate allele specification.

Return the address of the vcf::Alt with the specified ID. If none is found, return NULL.

vcf::Filter *egglib::VcfParser::find_filter(const char *id)

Find a filter specification.

Return the address of the vcf::Filter with the specified ID. If none is found, return NULL.

vcf::Format *egglib::VcfParser::find_format(const char *id)

Find a format specification.

Return the address of the vcf::Format with the specified ID. If none is found, return NULL.

vcf::Info *egglib::VcfParser::find_info(const char *id)

Find an info specification.

Return the address of the vcf::Info with the specified ID. If none is found, return NULL.

vcf::Meta *egglib::VcfParser::find_meta(const char *key)

Find a meta-information specification.

Return the address of the vcf::Meta with the specified key. If none is found, return NULL.

const vcf::FlagInfo egglib::VcfParser::FlagInfo(unsigned int)
const

Get a Flag-type INFO entry for the last read variant.

const vcf::TypeInfo<double> &egglib::VcfParser::FloatInfo(unsigned int)
const

Get a Float-type INFO entry for the last read variant.

const vcf::Alt *egglib::VcfParser::get_alt(unsigned int i)
const

Get a specific alternative allele entry.

const vcf::Filter *egglib::VcfParser::get_filter(unsigned int i)
const

Get a specific filter entry.

const vcf::Format *egglib::VcfParser::get_format(unsigned int i)
const

Get a specific format entry.

long long int egglib::VcfParser::get_idx_frt_sample()

gets the index of the first variant in the VcfParser

long long int egglib::VcfParser::get_index()
const

get current position in stream

Don’t call this if no stream set

const vcf::Info *egglib::VcfParser::get_info(unsigned int i)
const

Get a specific info entry.

const vcf::Meta *egglib::VcfParser::get_meta(unsigned int i)
const

Get a specific meta-information entry.

const char *egglib::VcfParser::get_sample(unsigned int i)
const

Get a sample name.

unsigned int egglib::VcfParser::get_threshold_GL()
const

Get the threshold used for extracting GT from PL.

unsigned int egglib::VcfParser::get_threshold_PL()
const

Get the threshold used for extracting GT from PL.

double egglib::VcfParser::GL(unsigned int sample, unsigned int genotype)
const

Get a GL value.

bool egglib::VcfParser::good()
const

Check if object is ready to parse.

Return
true if the object has a valid stream and the stream is ready to parse data and not end of file.

unsigned int egglib::VcfParser::GT(unsigned int sample, unsigned int allele)
const

Get a genotype value.

If has_GT() returns false, the behaviour of this method is undefined.

Return
The index of the allele carried by this sample at this ploidy index, or egglib::UNKNOWN if value is missing. It is NOT guaranteed that if one allele is missing, all are. The method returns 0 if the carried allele is the reference. Otherwise, it returns 1 + the index of the allele within the list given by alternate alleles.
Parameters
  • sample -

    sample index (must be < num_samples()).

  • allele -

    allele, or chromosome, index (must be < ploidy()).

bool egglib::VcfParser::GT_phased(unsigned int i)
const

Check if the genotype of a given sample is phased.

If has_GT() returns false, or if ploidy() returns 1, the returned value is undefined.

bool egglib::VcfParser::GT_phased()
const

Check if the genotype of all samples are phased.

If has_GT() returns false, or if ploidy() returns 1, the returned value is undefined. If the number of samples is 0, returns true.

bool egglib::VcfParser::has_AA()
const

Check if the INFO field AA (ancestral allele) is available.

The method returns false if the AA field is not defined for the last variant or if its definition does not match the standard.

bool egglib::VcfParser::has_AC()
const

Check if the INFO field AC (allele absolute frequencies) is available.

The method returns false if the AC field is not defined for the last variant, or if its definition does not match the expectation, or if AC is not available.

bool egglib::VcfParser::has_AF()
const

Check if the INFO field AF (allele relative frequencies) is available.

The method returns false if the AF field is not defined for the last variant, or if its definition does not match the expectation.

bool egglib::VcfParser::has_AN()
const

Check if the INFO field AN (number of called alleles) is available.

The method returns false if the AN field is not defined for the last variant, or if its definition does not match the standard.

bool egglib::VcfParser::has_data()
const

True if any data has been read.

bool egglib::VcfParser::has_GL()
const

Check if the Variant read has GL data.

bool egglib::VcfParser::has_GT()
const

Check if the FORMAT field GT (genotype of each sample) is available.

The method returns false if the GT field is not defined for the last variant, or if its definition does not match the expectation.

bool egglib::VcfParser::has_index()

checks if the VcfParser has an index file loaded in a object VcfIndex.

bool egglib::VcfParser::has_PL()
const

Check if the Variant read has PL data.

const char *egglib::VcfParser::ID(unsigned int i)
const

Get an ID from the last read record.

const char *egglib::VcfParser::index_fname()

gets a default name for an index file according the name of a Vcf File loaded in the VcfParser

const vcf::TypeInfo<int> &egglib::VcfParser::IntegerInfo(unsigned int)
const

Get a Integer-type INFO entry for the last read variant.

bool egglib::VcfParser::is_outgroup(unsigned int)
const

Tells if a sample is set to be outgroup.

unsigned int egglib::VcfParser::len_reference()
const

Get the length of the reference allele from the last read variant.

The default is 0.

unsigned int egglib::VcfParser::num_AC()
const

Get the number of AN values, if defined.

The number of AC values is equal to the number of alternate alleles. If AC is not available, returns an undefined value.

unsigned int egglib::VcfParser::num_AF()
const

Get the number of AF values, if defined.

The number of AF values is equal to the number of alternate alleles. If AF is not available, returns an undefined value.

unsigned int egglib::VcfParser::num_alt()
const

Get the number of alternative allele entries of the instance.

unsigned int egglib::VcfParser::num_alternate()
const

Get the number of alternate alleles for the last read variant.

The value is 0 is no variants were provided.

unsigned int egglib::VcfParser::num_CharacterInfo()
const

Get the number of Character-type INFO entries for the last read variant.

unsigned int egglib::VcfParser::num_failed_tests()
const

Get the number of failed tests for the last read variants.

The value is 0 is all tests passed, and egglib::UNKNOWN if no tests were performed (missing value in file).

unsigned int egglib::VcfParser::num_fields()
const

Get the number of sample FORMAT fields for the last read variant.

unsigned int egglib::VcfParser::num_filter()
const

Get the number of filter entries of the instance.

unsigned int egglib::VcfParser::num_FlagInfo()
const

Get the number of Flag-type INFO entries for the last read variant.

unsigned int egglib::VcfParser::num_FloatInfo()
const

Get the number of Float-type INFO entries for the last read variant.

unsigned int egglib::VcfParser::num_format()
const

Get the number of format entries of the instance.

unsigned int egglib::VcfParser::num_genotypes()
const

Number of genotypes.

unsigned int egglib::VcfParser::num_ID()
const

Get the number of IDs of the last read variant.

unsigned int egglib::VcfParser::num_info()
const

Get the number of info entries of the instance.

unsigned int egglib::VcfParser::num_IntegerInfo()
const

Get the number of Integer-type INFO entries for the last read variant.

unsigned int egglib::VcfParser::num_meta()
const

Get the number of meta-information entries of the instance.

This excludes any FILTER, INFO, FORMAT, ALT and the fileformat.

unsigned int egglib::VcfParser::num_samples()
const

Get the number of samples read from the header.

unsigned int egglib::VcfParser::num_StringInfo()
const

Get the number of String-type INFO entries for the last read variant.

void egglib::VcfParser::open_file(const char *fname)

Open a file for reading.

This method attempts to open the specified file and to read a the VCF header. If the file cannot be open, an EggOpenFileError exception is thrown; if the header is invalid an EggFormatError exception is thrown.

In case the instance was already processing a stream, it will be dismissed. The stream created by this method will be closed if another stream is created or set (call to open_file() or set_stream() methods), if the close() method is called or upon object destruction.

Parameters
  • fname -

    name of the VCF-formatted file to open.

bool egglib::VcfParser::outgroup_AA()
const

Tells if AA data should be picked to generate sites.

unsigned int egglib::VcfParser::PL(unsigned int sample, unsigned int genotype)
const

Get a PL value.

unsigned int egglib::VcfParser::ploidy()
const

Returns the ploidy of the last variant.

The ploidy must be a strictly positive number. If has_GT() returns false, the returned value is 2.

unsigned long egglib::VcfParser::position()
const

Get the last read variant position.

If the position was the one before first, the constant BEFORE is returned.

double egglib::VcfParser::quality()
const

Get the quality of the last read variant.

Returns the phred-scaled quality of the variant (or no variation), or egglib::UNDEF in case of missing data. Note that UNDEF is a large negative value.

void egglib::VcfParser::read()

Read a single variant.

void egglib::VcfParser::read_chromosome()

get only chromosome name

void egglib::VcfParser::read_header(const char *string)

Pass header string for reading.

In case the instance was already processing a stream, it will be dismissed. This function opens no stream, and it will not be able read any further line using using read_line().

Parameters
  • string -

    string containing the header.

void egglib::VcfParser::read_index(const char *fname)

this method allows to find, read and load as an VcfIndex an Index file linked to the current Vcf Parser. The search of the index file, is done with a default name generated from the name of the Vcf file loaded in the current VcfParser.

Parameters
  • fname -

    path of an index file

void egglib::VcfParser::read_line(const char *string)

Read a single variant from a provided string.

const char *egglib::VcfParser::reference()
const

Get the reference allele from the last read variant.

The default is an empty string.

void egglib::VcfParser::reset()

Reset this objet.

This method closes the file that was opened using the open_file() method. If the file was open using the open_file() method of the same instance, it is actually closed. If the file was passed as a stream using set_stream(), it is forgotten but not closed. If no stream is present, this method does nothing.

void egglib::VcfParser::reset_variant()

Forget information from last read variant.

It is not necessary to call this method before calling read(), even if variants have been read previously.

void egglib::VcfParser::rewind()

this method allows to move the VcfParser to the first variant of the Vcffile loaded.

const vcf::SampleInfo &egglib::VcfParser::sample_info(unsigned int i)
const

Get all FORMAT fields for a sample for the last read variant.

The returned instance contains all FORMAT fields for the specified sample. The class vcf::SampleInfo provides methods to access data which are based on the type and on the field rank among of the fields of the same type.

Assuming you know that the type of the data field you wish to extract is, say, a String and its ID, you should first get its type-based rank using field_rank(const char *) using its ID and then call vcf::SampleInfo::StringItem(unsigned int, unsigned int) to get the value of an item. Take care that missing values result in fields with 0 items.

void egglib::VcfParser::set_currline(unsigned int line)

set _currline variable

void egglib::VcfParser::set_index(long long int index)

set position in stream

Don’t call this if no stream set

void egglib::VcfParser::set_outgroup(unsigned int)

Set a sample to be outgroup.

void egglib::VcfParser::set_outgroup_AA()

Tells to add AA data as first sample of outgroup when sites are generated.

void egglib::VcfParser::set_previous_next()

set the attribut previous_next

Don’t call this if no stream set

void egglib::VcfParser::set_stream(std::istream &stream)

Pass an open stream for reading.

This method set the passed stream (which is supposed to have been opened for reading). If the stream is not good for reading, an EggArgumentValueError (and not EggOpenFileError) exception is thrown.

In case the instance was already processing a stream, it will be dismissed. The stream passed by this method not be closed by the class even when calling close().

Parameters
  • stream -

    open stream to read VCF data from.

void egglib::VcfParser::set_threshold_GL(unsigned int)

Set the threshold for extracting GT from PL (UNKNOWN to prevent, otherwise 1 or more, never 0)

void egglib::VcfParser::set_threshold_PL(unsigned int)

Set the threshold for extracting GT from PL (UNKNOWN to prevent, otherwise 1 or more, never 0)

const vcf::StringInfo &egglib::VcfParser::StringInfo(unsigned int)
const

Get a String-type INFO entry for the last read variant.

void egglib::VcfParser::unread()

Unread a single variant.

Don’t call this if no single variant has been read.

Helper classes

class

Class to handle VCF alternate allele definitions.

By default, string accessors return null pointers.

Header: <egglib-cpp/VCF.hpp>

Public Functions

egglib::vcf::Alt::Alt()

Default constructor.

egglib::vcf::Alt::Alt(const char *id, const char *descr)

Initialization constructor.

egglib::vcf::Alt::Alt(Alt &src)

Copy constructor.

egglib::vcf::Alt::Alt(Filter &src)

Copy constructor.

egglib::vcf::Alt::~Alt()

Destructor.

Alt &egglib::vcf::Alt::operator=(Alt &src)

Copy assignemnt operator.

Alt &egglib::vcf::Alt::operator=(Filter &src)

Copy assignemnt operator.

class

Class to handle VCF FILTER specifications.

By default, string accessors return null pointers.

Header: <egglib-cpp/VCF.hpp>

Public Functions

egglib::vcf::Filter::Filter()

Default constructor.

egglib::vcf::Filter::Filter(const char *id, const char *descr)

Initialization constructor.

egglib::vcf::Filter::Filter(const Filter &src)

Copy constructor.

virtual egglib::vcf::Filter::~Filter()

Destructor.

void egglib::vcf::Filter::clear()

Actually free memory.

const char *egglib::vcf::Filter::get_description()
const

Get description string.

const char *egglib::vcf::Filter::get_extra_key(unsigned int idx)
const

Get extra field key.

const char *egglib::vcf::Filter::get_extra_value(unsigned int idx)
const

Get extra field value.

const char *egglib::vcf::Filter::get_ID()
const

Get ID string.

unsigned int egglib::vcf::Filter::get_num_extra()
const

Get number of extra fields.

Filter &egglib::vcf::Filter::operator=(const Filter &src)

Copy assignment operator.

void egglib::vcf::Filter::set_description(const char *descr)

Set description string.

void egglib::vcf::Filter::set_extra(const char *key, const char *value)

Set extra field.

void egglib::vcf::Filter::set_ID(const char *id)

Set ID string.

void egglib::vcf::Filter::update(const char *id, const char *descr)

Setter (reset extra fields)

class

Class representing a Flag-type INFO field.

Header: <egglib-cpp/VCF.hpp>

Public Functions

egglib::vcf::FlagInfo::FlagInfo()

Constructor.

egglib::vcf::FlagInfo::FlagInfo(const FlagInfo &src)

Copy constructor.

virtual egglib::vcf::FlagInfo::~FlagInfo()

Destructor.

const char *egglib::vcf::FlagInfo::get_ID()
const

Get ID string.

FlagInfo &egglib::vcf::FlagInfo::operator=(const FlagInfo &src)

Copy assignment operator.

void egglib::vcf::FlagInfo::set_ID(const char *id)

Set ID string.

class

Class to handle VCF FORMAT specifications.

Format is identical in structure to Info except that the type Flag is not allowed (an EggArgumentValueError is caused when attempting to set Flag as type).

Header: <egglib-cpp/VCF.hpp>

Public Functions

egglib::vcf::Format::Format()

Default constructor.

egglib::vcf::Format::Format(const char *id, unsigned int num, Info::Type t, const char *descr)

Initialization constructor.

egglib::vcf::Format::Format(const Format &src)

Copy constructor.

egglib::vcf::Format::Format(const Info &src)

Copy constructor.

virtual Info::Type egglib::vcf::Format::get_type()
const

Get type.

Format &egglib::vcf::Format::operator=(const Format &src)

Copy assignment operator.

Format &egglib::vcf::Format::operator=(const Info &src)

Copy assignment operator.

virtual void egglib::vcf::Format::set_type(Info::Type t)

Set type.

void egglib::vcf::Format::update(const char *id, unsigned int num, Info::Type t, const char *descr)

Setter.

class

Class to handle VCF INFO specifications.

By default, string accessors return null pointers and other accessors return undefined values. The number of values might be egglib::UNKNOWN (unspecified number of values, represented by the character “.” in files) or vcf::NUM_ALTERNATE (match the number of ALT variants, represented by the character “A” in files) or vcf::NUM_GENOTYPES (match the number of possible genotypes, represented by the character “G” in files). Note that UNKNOWN, vcf::NUM_ALTERNATE and vcf::NUM_GENOTYPES are all large positive values. One more special value: vcf::NUM_POSSIBLE_ALLELES (like vcf::NUM_ALTERNATE but including the reference).

Header: <egglib-cpp/VCF.hpp>

Public Types

enum egglib::vcf::Info::Type

Meta-information types.

This enum is used to specify FORMAT and INFO types (Flag is not accepted only for FORMAT).

Values:

Public Functions

egglib::vcf::Info::Info()

Default constructor.

egglib::vcf::Info::Info(const char *id, unsigned int num, Info::Type t, const char *descr)

Initialization constructor.

egglib::vcf::Info::Info(const Info &src)

Copy constructor.

unsigned int egglib::vcf::Info::get_number()
const

Get number of values.

virtual Info::Type egglib::vcf::Info::get_type()
const

Get type.

Info &egglib::vcf::Info::operator=(const Info &src)

Copy assignment operator.

void egglib::vcf::Info::set_number(unsigned int num)

Set number of values.

virtual void egglib::vcf::Info::set_type(Info::Type t)

Set type.

void egglib::vcf::Info::update(const char *id, unsigned int num, Info::Type t, const char *descr)

Setter.

class

Class to handle VCF meta-information entries.

By default, string accessors return null pointers.

Header: <egglib-cpp/VCF.hpp>

Public Functions

egglib::vcf::Meta::Meta()

Default constructor.

egglib::vcf::Meta::Meta(const char *k, const char *v)

Initialization constructor.

egglib::vcf::Meta::Meta(Meta &src)

Copy constructor.

egglib::vcf::Meta::~Meta()

Destructor.

void egglib::vcf::Meta::clear()

Actually free memory.

const char *egglib::vcf::Meta::get_key()
const

Get key.

const char *egglib::vcf::Meta::get_value()
const

Get value.

Meta &egglib::vcf::Meta::operator=(Meta &src)

Copy assignemnt operator.

void egglib::vcf::Meta::set_key(const char *k)

Set key.

void egglib::vcf::Meta::set_value(const char *v)

Set value.

void egglib::vcf::Meta::update(const char *k, const char *v)

Setter.

class

Class storing information fields for a sample.

Header: <egglib-cpp/VCF.hpp>

Public Functions

egglib::vcf::SampleInfo::SampleInfo()

Constructor.

egglib::vcf::SampleInfo::SampleInfo(const SampleInfo &src)

Copy constructor.

egglib::vcf::SampleInfo::~SampleInfo()

Destructor.

char egglib::vcf::SampleInfo::CharacterItem(unsigned int i, unsigned int j)
const

Get an item for an Character-type entry.

void egglib::vcf::SampleInfo::clear()

Actually free all memory allocated by this instance.

double egglib::vcf::SampleInfo::FloatItem(unsigned int i, unsigned int j)
const

Get an item for an Float-type entry.

int egglib::vcf::SampleInfo::IntegerItem(unsigned int i, unsigned int j)
const

Get an item for an Integer-type entry.

unsigned int egglib::vcf::SampleInfo::num_CharacterEntries()
const

Number of Character-type entries.

unsigned int egglib::vcf::SampleInfo::num_CharacterItems(unsigned int i)
const

Number of items for an Character-type entry.

unsigned int egglib::vcf::SampleInfo::num_FloatEntries()
const

Number of Float-type entries.

unsigned int egglib::vcf::SampleInfo::num_FloatItems(unsigned int i)
const

Number of items for an Float-type entry.

unsigned int egglib::vcf::SampleInfo::num_IntegerEntries()
const

Number of Integer-type entries.

unsigned int egglib::vcf::SampleInfo::num_IntegerItems(unsigned int i)
const

Number of items for an Integer-type entry.

unsigned int egglib::vcf::SampleInfo::num_StringEntries()
const

Number of String-type entries.

unsigned int egglib::vcf::SampleInfo::num_StringItems(unsigned int i)
const

Number of items for an String-type entry.

SampleInfo &egglib::vcf::SampleInfo::operator=(const SampleInfo &src)

Copy assignment operator.

void egglib::vcf::SampleInfo::reset()

Restore the object to the initial state.

This method does not free allocated memory, which is reserved for latter use.

const char *egglib::vcf::SampleInfo::StringItem(unsigned int i, unsigned int j)
const

Get an item for an String-type entry.

class

Class from String-type INFO fields.

Header: <egglib-cpp/VCF.hpp>

Public Functions

egglib::vcf::StringInfo::StringInfo()

Constructor.

egglib::vcf::StringInfo::StringInfo(const StringInfo &src)

Copy constructor.

virtual egglib::vcf::StringInfo::~StringInfo()

Destructor.

void egglib::vcf::StringInfo::change(unsigned int item, unsigned int position, char value)

Set a character (must fit in length)

StringInfo &egglib::vcf::StringInfo::operator=(const StringInfo &src)

Copy assignment operator.

template <class T>
class

Template for Character (char), Integer (int) and Float (double)-type INFO fields.

Header: <egglib.cpp/VCF.hpp>

Public Functions

egglib::vcf::TypeInfo::TypeInfo()

Constructor.

egglib::vcf::TypeInfo::TypeInfo(const TypeInfo<T> &src)

Copy constructor.

virtual egglib::vcf::TypeInfo::~TypeInfo()

Destructor.

unsigned int egglib::vcf::TypeInfo::get_expected_number()
const

Get expected number of items.

const T &egglib::vcf::TypeInfo::item(unsigned int i)
const

Get an item (missing data are encoded by type-specific special values)

unsigned int egglib::vcf::TypeInfo::num_items()
const

Get number of items available in the instance.

TypeInfo<T> &egglib::vcf::TypeInfo::operator=(const TypeInfo<T> &src)

Copy assignment operator.

void egglib::vcf::TypeInfo::reset()

Reset instance.

void egglib::vcf::TypeInfo::set_expected_number(unsigned int n)

Set expected number of items.

VcfWindow class

class

Public Functions

egglib::VcfWindow::VcfWindow()

Constructor.

egglib::VcfWindow::~VcfWindow()

Destructor.

const char *egglib::VcfWindow::chromosome()
const

Get chromosome.

unsigned int egglib::VcfWindow::first_pos()
const

Window first position (UNKNOWN if unit is not bp and no site at all)

const WSite *egglib::VcfWindow::first_site()
const

Get first site (NULL if no site at all)

unsigned int egglib::VcfWindow::first_site_pos()
const

Position of the first site.

const WSite *egglib::VcfWindow::get_site(unsigned int)
const

Get a random site (slower there must be enough sites)

bool egglib::VcfWindow::good()
const

False if sliding has completed.

unsigned int egglib::VcfWindow::last_pos()
const

Window last position (included) (UNKNOWN if unit is not bp and no site at all)

const WSite *egglib::VcfWindow::last_site()
const

Get last site (NULL if no site at all)

unsigned int egglib::VcfWindow::last_site_pos()
const

Position of the last site.

void egglib::VcfWindow::next_window()

Load the next window.

unsigned int egglib::VcfWindow::num_samples()
const

Number of samples.

unsigned int egglib::VcfWindow::num_sites()
const

Number of actual sites (from vcf)

void egglib::VcfWindow::setup(VcfParser &vcf, unsigned int wsize, unsigned int wstep, bool unit_bp, unsigned int start_pos, unsigned int stop_pos, unsigned int max_missing)

Setup a new sliding window.

WSite class

class

Class for sites within a double-linked list WSite pool.

Public Functions

egglib::WSite::WSite(WPool *p)

Constructor.

egglib::WSite::~WSite()

Destructor.

unsigned int egglib::WSite::get_pos()
const

Get site position.

void egglib::WSite::init()

Set pointers to NULL.

WSite *egglib::WSite::next()

Get next site.

WSite *egglib::WSite::pop_back()

Disconnect last, return it to pool, return its predecessor.

WSite *egglib::WSite::pop_front()

Disconnect first, return it to pool, return its follower.

WSite *egglib::WSite::prev()

Get previous site.

WSite *egglib::WSite::push_back(WSite *ws)

Add site to the end, return new end.

void egglib::WSite::reset(unsigned int pl)

Reset values.

void egglib::WSite::set_pos(unsigned int)

Set site position.

SiteHolder &egglib::WSite::site()

Get the included site object.

Others

GFF3 class

class

GFF3 parser

Read GFF3-formatted genome annotation data from a file specified by name or from an open stream.

The description of the GFF3 format:

http://www.sequenceontology.org/gff3.shtml

This class supports segmented features but only if they are consecutive in the file. All features are loaded into memory and can be processed interatively. Two accesors are provided: one, feature() and num_features(), allows to process all imported features in the order in which they were loaded. Each provides access to its own parents and parts; and the second, gene() and num_genes(), allows to process the subset of the latter that are of type gene.

Header: <egglib-cpp/GFF3.hpp>

Public Functions

egglib::GFF3::GFF3()

Build an empty object.

virtual egglib::GFF3::~GFF3()

Destructor.

void egglib::GFF3::clear()

Like reset() but actually free memory.

Feature &egglib::GFF3::feature(unsigned int i)

Get a feature.

Feature &egglib::GFF3::gene(unsigned int i)

Get a gene feature.

void egglib::GFF3::liberal(bool flag)

Set the liberal flag.

The liberal feature allows a few violations of the format as specified in the standard (more violations might be added in the future:

  • CDS features may lack a phase (then the no_phase value is used.

By default, the parsers are strict. The new value affects all consecutive parsing operations.

const char *egglib::GFF3::metadata_key(unsigned int i)
const

Get a meta-data key.

const char *egglib::GFF3::metadata_value(unsigned int i)
const

Get a meta-data value.

unsigned int egglib::GFF3::num_features()
const

Total number of features.

unsigned int egglib::GFF3::num_genes()
const

Number of gene features.

unsigned int egglib::GFF3::num_metadata()
const

Get the number of defined meta-data.

void egglib::GFF3::parse(const char *fname)

Parse a GFF3-formatted file.

void egglib::GFF3::parse(std::istream &stream)

Parse an open GFF3-formatted stream.

void egglib::GFF3::parse_string(std::string &string)

Parse an open GFF3-formatted string.

void egglib::GFF3::reset()

Clear data stored in the object (but don’t free memory)

const DataHolder &egglib::GFF3::sequences()
const

Get sequences (in case they were present in file) as a non-matrix object.

class

Annotation feature.

Objects of this class are used to describe annotation features in the GFF3 format read by the GFF3 class.

Header: <egglib-cpp/GFF3.hpp>

Public Types

enum egglib::Feature::PHASE

Enum for reading frame specification.

Values:

Codon starts at first base.

Codon starts at second base.

Codon starts at third base.

Not defined (irrelevant)

enum egglib::Feature::STRAND

Enum for strand specification.

Values:

Forward strand.

Reverse strand.

Not defined (irrelevant)

Public Functions

egglib::Feature::Feature()

Constructor.

egglib::Feature::Feature(const Feature &src)

Copy constructor.

Note that “parent” and “parts” members (which are links to other Feature objects) are shallow-copied. As a result, the copy object will point to the same parents and/or parts as the original.

virtual egglib::Feature::~Feature()

Destructor.

void egglib::Feature::clear()

Actually release memory of the instance.

const char *egglib::Feature::get_Alias(unsigned int i)
const

Get the “Alias” attribute at the specified index.

const char *egglib::Feature::get_attribute_key(unsigned int i)
const

Get a custom attribute key.

const char *egglib::Feature::get_attribute_value(unsigned int attr, unsigned int item)
const

Get the value of a custom attribute item.

const char *egglib::Feature::get_Dbxref(unsigned int i)
const

Get the “Dbxref” attribute at the specified index.

const char *egglib::Feature::get_Derives_from()
const

Get “Derives_from” attribute (empty string if missing)

unsigned int egglib::Feature::get_end(unsigned int i)
const

Get “end” for a given fragment.

const char *egglib::Feature::get_Gap()
const

Get “Gap” attribute (empty string if missing)

const char *egglib::Feature::get_ID()
const

Get “ID” attribute (empty string if missing)

bool egglib::Feature::get_Is_circular()
const

Get “Is_circular” attribute (true if present)

const char *egglib::Feature::get_Name()
const

Get “Name” attribute (empty string if missing)

const char *egglib::Feature::get_Note(unsigned int i)
const

Get the “Note” attribute at the specified index.

unsigned int egglib::Feature::get_num_Alias()
const

Get the number of “Alias” attributes.

unsigned int egglib::Feature::get_num_attributes()
const

Number of custom attributes.

Does not take into account pre-defined attributes which are identified by a first capital letter, and which are accessible through specific methods.

unsigned int egglib::Feature::get_num_Dbxref()
const

Get the number of “Dbxref” attributes.

unsigned int egglib::Feature::get_num_fragments()
const

Get number of fragments (number of “start” and “end” fields; 0 by default)

unsigned int egglib::Feature::get_num_items_attribute(unsigned int i)
const

Number of items of a custom attribute.

unsigned int egglib::Feature::get_num_Note()
const

Get the number of “Note” attributes.

unsigned int egglib::Feature::get_num_Ontology_term()
const

Get the number of “Ontology_term” attributes.

unsigned int egglib::Feature::get_num_Parent()
const

Get the number of “Parent” attributes.

unsigned int egglib::Feature::get_num_parents()
const

Get the number of parents.

unsigned int egglib::Feature::get_num_parts()
const

Get the number of parts.

const char *egglib::Feature::get_Ontology_term(unsigned int i)
const

Get the “Ontology_term” attribute at the specified index.

const char *egglib::Feature::get_Parent(unsigned int i)
const

Get the “Parent” attribute at the specified index.

Feature *egglib::Feature::get_parent(unsigned int i)
const

Get a parent.

Feature *egglib::Feature::get_part(unsigned int i)
const

Get a part.

Feature::PHASE egglib::Feature::get_phase()
const

Get “phase” field (no_phase by default)

double egglib::Feature::get_score()
const

Get “score” field (UNDEF by default)

egglib::UNDEF stands for undetermined. It equals to a very large negative value. If the file had a “NaN” (or any case combination), UNDEF is also returned.

const char *egglib::Feature::get_seqid()
const

Get “seqid” field (empty string by default)

const char *egglib::Feature::get_source()
const

Get “source” field (empty string by default)

unsigned int egglib::Feature::get_start(unsigned int i)
const

Get “start” for a given fragment.

Feature::STRAND egglib::Feature::get_strand()
const

Get “strand” field (no_strand by default)

const char *egglib::Feature::get_Target()
const

Get “Target” attribute (empty string if missing)

const char *egglib::Feature::get_type()
const

Get “type” field (empty string by default)

Feature &egglib::Feature::operator=(const Feature &src)

Copy assignment operator.

Note that “parent” and “parts” members (which are links to other Feature objects) are shallow-copied. As a result, the copy object will point to the same parents and/or parts as the original.

void egglib::Feature::reset()

Reset instance (but retain allocated memory)

void egglib::Feature::set_Alias(unsigned int i, const char *s)

Set the “Alias” attribute at the specified index.

void egglib::Feature::set_attribute_key(unsigned int i, const char *str)

Set a custom attribute key.

Legal keys don’t start with a capital letter (otherwise you must use one of the pre-defined attributes).

void egglib::Feature::set_attribute_value(unsigned int attr, unsigned int item, const char *str)

Set a custom attribute value.

void egglib::Feature::set_Dbxref(unsigned int i, const char *s)

Set the “Dbxref” attribute at the specified index.

void egglib::Feature::set_Derives_from(const char *str)

Set “Derives_from” attribute (use an empty string to skip)

void egglib::Feature::set_end(unsigned int i, unsigned int val)

Set “end” for a given fragment.

void egglib::Feature::set_Gap(const char *str)

Set “Gap” attribute (use an empty string to skip)

void egglib::Feature::set_ID(const char *str)

Set “ID” attribute (use an empty string to skip)

void egglib::Feature::set_Is_circular(bool b)

Set “Is_circular” attribute (false to skip)

void egglib::Feature::set_Name(const char *str)

Set “Name” attribute (use an empty string to skip)

void egglib::Feature::set_Note(unsigned int i, const char *s)

Set the “Note” attribute at the specified index.

void egglib::Feature::set_num_Alias(unsigned int num)

Set the number of “Alias” attributes.

If the new value is larger, new attributes are set to empty strings. If the new value is smaller, last attributes are lost.

void egglib::Feature::set_num_attributes(unsigned int num)

Set the number of attributes.

Does not take into account pre-defined attributes (which are identified by a first capital letter). If the new value is larger, new attributes are set to an empty string for key, and a number of items of 0. If the new value is smaller, last attributes are lost.

void egglib::Feature::set_num_Dbxref(unsigned int num)

Set the number of “Dbxref” attributes.

If the new value is larger, new attributes are set to empty strings. If the new value is smaller, last attributes are lost.

void egglib::Feature::set_num_fragments(unsigned int num)

Set number of fragments (number of “start” and “end” fields)

If the value is larger, new values are initialized to 0 (both “start” and “end”). If the new value is smaller, last values are lost.

void egglib::Feature::set_num_items_attribute(unsigned int i, unsigned int num)

Set the number of items of a custom attribute.

If the new value is larger, new items are set to empty strings. If the new value is smaller, last items are lost.

void egglib::Feature::set_num_Note(unsigned int num)

Set the number of “Note” attributes.

If the new value is larger, new attributes are set to empty strings. If the new value is smaller, last attributes are lost.

void egglib::Feature::set_num_Ontology_term(unsigned int num)

Set the number of “Ontology_term” attributes.

If the new value is larger, new attributes are set to empty strings. If the new value is smaller, last attributes are lost.

void egglib::Feature::set_num_Parent(unsigned int num)

Set the number of “Parent” attributes.

If the new value is larger, new attributes are set to empty strings. If the new value is smaller, last attributes are lost.

void egglib::Feature::set_num_parents(unsigned int num)

Set the number of parents.

If the new value is larger, new items are set to NULL. If the new value is smaller, last items are lost.

void egglib::Feature::set_num_parts(unsigned int num)

Set the number of parts.

If the new value is larger, new items are set to NULL. If the new value is smaller, last items are lost.

void egglib::Feature::set_Ontology_term(unsigned int i, const char *s)

Set the “Ontology_term” attribute at the specified index.

void egglib::Feature::set_Parent(unsigned int i, const char *s)

Set the “Parent” attribute at the specified index.

void egglib::Feature::set_parent(unsigned int i, Feature *feat)

Set a parent.

void egglib::Feature::set_part(unsigned int i, Feature *feat)

Set a part.

void egglib::Feature::set_phase(PHASE p)

Set “phase” field.

void egglib::Feature::set_score(double d)

Set “score” field.

void egglib::Feature::set_seqid(const char *str)

Set “seqid” field.

void egglib::Feature::set_source(const char *str)

Set “source” field.

void egglib::Feature::set_start(unsigned int i, unsigned int val)

Set “start” for a given fragment.

void egglib::Feature::set_strand(STRAND s)

Set “strand” field.

void egglib::Feature::set_Target(const char *str)

Set “Target” attribute (use an empty string to skip)

void egglib::Feature::set_type(const char *str)

Set “type” field.