EggLib

Table Of Contents

Previous topic

Python Module

Next topic

Generic tools

This Page

Genetic data holder types

The classes Align and Container allow store and manipulates genetic data without restriction on data encoding. Tree allows to store and manipulate trees.

Align

class egglib.Align(nsam=0, nout=0, nsit=0, init=0)

Bases: egglib._interface.DataBase

Holds a data set with associated sample names and group information. The data consists of given numbers of ingroup and outgroup samples, each with the same number of sites. There can be any number of group levels (but this number must be the same for all samples), meaning that samples can be described by several group labels in addition to their name. Group labels are not group indices (they do not need to be consecutive). There is a separate data set for samples belonging to the outgroup. There can be any number of outgroup samples. Outgroup samples always have one level of group labels that should be used to specify individuals (when appropriate). All data are represented by signed integers

By default, the constructor generates an empty instance (0 samples and 0 sites).

By default, samples have empty names and no group levels are defined.

Parameters:
  • num_sam – number of ingroup samples.
  • num_out – number of outgroup samples.
  • num_sit – number of sites.
  • init – initial values for all data entries (may be a signed integer or a single-character string; ignored if num_sam or num_sit is 0).

New in version 3.0.0: Reimplementation of the Align class.

add_outgroup(name, data, group=None)

Add an outgroup sample to the instance.

Parameters:
  • name – name of the new sample.
  • data – an iterable (may be a string or a list) containing the data to set for the new sample. For an Align instance and if the sample is not the first, the number of data must fit any previous ones (ingroup or outgroup).
  • group – if not None, must be an unsigned integer (the default value is 0).
add_sample(name, data, groups=None)

Add a sample to the instance.

Parameters:
  • name – name of the new sample.
  • data – an iterable (may be a string or a list) containing the data to set for the new sample. For an Align instance and if the sample is not the first, the number of data must fit any previous ones (including outgroup samples, if any).
  • groups – if not None, must be an iterable with an unsigned integer for each group levels of the instance (if None, all group labels are set to 0), or a single integer value if only one level needs to be specified.
add_samples(items)

Add several samples at the end of the instance.

Parameters:items – items must be an iterable that have a length (for example a list, an Align instance or a Container instance. Each item of items must be of length 2 or 3. The first item must be the sample name string, the second item is the data values (as a list of signed integers or a single string) and the third item, if provided, is the list of group labels (one unsigned integer for each level). See the method add_sample() for more details about each added item. If the current instance is an Align, all added items must have the same length which must be the same as any items currently present in the instance. For a Container, the items may have different lengths. The number of group levels is set by the sample with the larger number of group labels, and all samples with less labels are completed by e default value (0).

New in version 2.0.1: Original name is addSequences().

Changed in version 3.0.0: Renamed as add_samples. All items are added in one shot (rather than calling the one-sample add method iteratively).

clear()

Clear the instance and release all memory. In most cases, it is preferable to use the method reset().

column(index, ingroup=True, outgroup=True)

Extract the allele values of a site at a given position.

Parameters:
  • index – the index of a site within the alignment.
  • ingroup – A boolean indicating whether ingroup samples should be extracted.
  • outgroup – A boolean indicating whether outgroup samples should be extracted.

Returns one or two lists of integers providing the allele for all samples. If both ingroup and outgroup are True, two lists are returned, respectively for the ingroup and for the outgroup. If only one of ingroup and outgroup is True, a single list is returned with data for the selected group. If neither is True, the returned value is None (no error).

consensus(ingroup=True, outgroup=False, ignore=False)

Generates the consensus of the object, assuming nucleotide sequences. The consensus is generated based on standard ambiguity (IUPAC) codes. The consensus is returned as a string object, of length matching the alignment length. The input alignment can contain nucleotide bases (A, C, G and T), and all ambiguity codes (V, H, M, D, R, W, B, S, Y, K and N). N stands for any of A, C, G and T. B stands for any of C, G and T and so on. Case is ignored. U is treated exactly as T. Gaps are represented by - and missing data by ?. Any other value will result in a ValueError. If a site is not variable, the fixed value is incorporated in the consensus in all cases.

Parameters:
  • ingroup – A boolean indicating whether the ingroup must be exported.
  • outgroup – A boolean indicating whether the outgroup must be exported.
  • ignore – a boolean indicating whether missing data should be ignored. If missing data are not ignored (by default), any missing data (including gaps) cause the whole site to have ? as consensus. If ignore is True, missing data are skipped and the consensus is based on the other (non-missing) data. If there is no non-missing data, the consensus is N.

Changed in version 3.0.0: Options are added, and implementation is modified.

classmethod create(obj)

Create a new instance by copying date from the data container passed as obj. The object obj can be:

  • an Align,
  • a Container (but all sequences are required to have the same length if the target type is Align),
  • any iterable type yielding a acceptable items (see the documentation for add_samples() for more details).

New in version 2.0.1.

del_columns(site, num=1)

Delete full columns for all ingroup and outgroup samples.

By default (if num=1), remove a single site. If num is larger than 1, remove a range of sites.

Parameters:
  • site – index of the (first) site to remove. This site must be a valid index.
  • num – maximal number of sites to remove. The value cannot be negative.
del_outgroup(index)

Delete an item of the instance.

Parameters:index – index of the sample to delete.
del_sample(index)

Delete an item of the instance.

Parameters:index – index of the sample to delete.
encode(nbits=10, random=None, include_outgroup=False)

Renames all sequences using a random mapping of names to unique keys of length nbits.

Parameters:
  • nbits – length of the keys (encoded names). This value must be >= 4 and <= 63.
  • random – a Random instance to be used a random generator. By default, use a default instance.
  • include_outgroup – a bool indicating whether the outgroup samples should also be considered (in this case, a single mapping is returned including both ingroup and outgroup samples).
Returns:

A dictionary mapping all the generated keys to the actual sequence names. The keys are case-dependent and guaranteed not to start with a number.

The returned mapping can be used to restore the original names using rename(). This method is not affected by the presence of sequences with identical names in the original instance (and rename() will also work properly in that case).

New in version 2.0.1.

Changed in version 3.0.0: Keys are forced to start with a capital letter. Now take a Random instance. Use library’s own random number generator. Added an option for the outgroup.

extract(*args)

Extract given positions (or columns) of the alignment and returns a new alignment.

The two possible ways to call this method are:

extract(start, stop) to extract a continuous range of sites,

and

extract(indexes) to extract a random list of positions (in any order).
Parameters:
  • start – first position to extract. This position must be a valid index for this alignment.
  • stop – stop position for the range to extract. This position is not extracted. If this position is equal to or smaller than start, empty sequences are extracted. If this position is equal to or larger than the length of the alignment, or if it is equal to None, all positions until the end of the alignment are extracted.
  • indexes – a list (or other iterable type with a length) of alignment positions (or column indexes). This list may contain repetitions and does not need to be sorted. The positions will be extracted in the specified order.

Keyword arguments are not supported.

New in version 2.0.1.

filter(ratio, valid='ACGTacgt', ingroup=True, outgroup=True, relative=True)

Removes the sequences with too few valid sites. This method modifies the current instance and returns None.

Parameters:
  • ratio – limit threshold, expressed as a proportion of either the maximum number of valid data over all processed samples (if the relative argument is True) or the alignment length (otherwise).
  • valid – a string or an interable of one-character strings or integers giving the allelic values considered to be valid (note that the comparisons are case-dependent).
  • ingroup – A boolean indicating whether ingroup samples must be processed.
  • outgroup – A boolean indicating whether outgroup samples must be processed.

If the length of the alignment is 0, or if both ingroup and outgroup are False, nothing is done.

find(name, include_outgroup=False, regex=False, multi=False, flags=None, index=False)

Find a sample by its name.

Parameters:
  • name – name of sample to identify.
  • include_outgroup – a boolean indicating whether the outgroup should be considered.
  • regex – a boolean indicating whether the value passed a name is a regular expression. If so, the string is passed as is to the re module (using function re.search()). Otherwise, only exact matches will be considered.
  • multi – a boolean indicating whether all hits should be returned. If so, a list of SampleView instances is always returned (the list will be empty in case of no hits). Otherwise, a single SampleView instance (or its index) will be returned for the first hit, or None in case of no hits.
  • flags – list of flags to be passed to re.search() (ignored if regex is False). For example, when looking for samples containing the term “sample”, for being case insensitive, use the following syntax: align.find("sample", regex=True, flags=[re.I]). By default (None) no further argument is passed.
  • index – boolean indicating whether the index of the sample should be returned. In that case return values for hits are int (by default, SampleView) instances. Warning: it is not allowed to set both include_outgroup and index to True as there would be no way to distinguish between ingroup and outgroup indexes.
Returns:

None if no hits were found, a (potentially empty) list of SampleView instances or class:int if multi is True, or a single SampleView or int otherwise.

find_motif(sample, motif, start=0, stop=None)

Locate the first instance of a motif for an ingroup sample

Returns the index of the first exact hit to a given substring. The returned value is the position of the first base of the hit. Only exact matches are implemented. To use regular expression (for example to find degenerated motifs), one should extract the string for the sequence and use a tool such as the regular expression module (re).

Parameters:
  • sample – sample index.
  • motif – a list of integer, or one-character strings (or mixing both) or a string constituting the motif to search.
  • start – position at which to start searching. The method will never return a value smaller than start. By default, search from the start of the sequence.
  • stop – position at which to stop search (the motif cannot overlap this position). No returned value will be larger than stop-n. By default, or if stop is equal to or larger than the length of the sequence, search until the end of the sequence.
find_motif_o(sample, motif, start=0, stop=None)

Locate the first instance of a motif for an outgroup sample

Returns the index of the first exact hit to a given substring. The returned value is the position of the first base of the hit. Only exact matches are implemented. To use regular expression (for example to find degenerated motifs), one should extract the string for the sequence and use a tool such as the regular expression module (re).

Parameters:
  • sample – outgroup sample index.
  • motif – a list of integer, or one-character strings (or mixing both) or a string constituting the motif to search.
  • start – position at which to start searching. The method will never return a value smaller than start. By default, search from the start of the sequence.
  • stop – position at which to stop search (the motif cannot overlap this position). No returned value will be larger than stop minus the length of the motif. By default, or if stop is equal to or larger than the length of the sequence, search until the end of the sequence.
fix_ends()

Designed for nucleotide sequence alignments. Replaces all leading and trailing occurrence of the alignment gap symbol (the numeric equivalent of -) by missing data symbols (?). Internal alignment gaps (those having at least one character other than - and ? at each side) are left unchanged.

Changed in version 3.0.0: Renamed from fix_gap_ends() to fix_ends().

get_i(sample, site)

Get a data entry.

Parameters:
  • sample – sample index.
  • site – site index.
get_label(sample, level)

Get a group label.

Parameters:
  • sample – sample index.
  • level – level index.
get_label_o(sample)

Get an outgroup label.

Parameters:sample – sample index.
get_name(index)

Get the name of a sample.

Parameters:index – sample index.
get_name_o(index)

Get the name of an outgroup sample.

Parameters:index – outgroup sample index.
get_o(sample, site)

Get an outgroup data entry.

Parameters:
  • sample – outgroup sample index.
  • site – site index.
get_outgroup(index)

Get the SampleView instance corresponding to the requested index for the outgroup. The returned object allows to modify the underlying data.

Parameters:index – index of the outgroup sample to access.
get_sample(index)

Get the SampleView instance corresponding to the requested index. The returned object allows to modify the underlying data.

Parameters:index – index of the sample to access.
get_sequence(index)

Access to the data entries of a given ingroup sample. Returns a SequenceView instance which allow modifying the underlying sequence container object.

Parameters:index – ingroup sample index.
get_sequence_o(index)

Access to the data entries of a given outgroup sample. Returns a SequenceView instance which allow modifying the underlying sequence container object.

Parameters:index – outgroup sample index.
group_mapping(level=0, indices=False)

Generates a dictionary mapping group labels to either SampleView instances (by default) or their indexes representing all samples of this instance. It can process and ingroup label level or the outgroup.

Parameters:
  • level – index of the level to consider. If None, processes the outgroup.
  • indices – if True, represent samples by their index (within the ingroup or the outgroup depending on level being non-None or None) instead of SampleView instances.
insert_columns(position, values)

Insert sites at a given position to an alignment.

Parameters:
  • position – the position at which to insert sites. Sites are inserted before the specified position, so the user can use 0 to insert sites at the beginning of the sequence. To insert sites at the end of the sequence, pass the current length of the alignment, or None. If position is larger than the length of the sequence or None, new sites are inserted at the end of the alignment. The position might be negative to count from the end. Warning: the position -1 means before the last position.
  • values – a list of signed integers, or a string, providing data to insert into the instance. The same sequence will be inserted for all ingroup and outgroup samples.
intersperse(length, positions=None, alleles='A', random=None)

Insert non-varying sites within the alignment. The current object is permanently modified.

Parameters:
  • length – Desired length of the final alignment. If the value is smaller than the original (current) alignment length, nothing is done and the alignment is unchanged.
  • positions – List of positions of sites of this alignment. The number of positions must be equal to the number of sites of the alignment (before interspersing). The argument value must be either a sequence of positive integers or a sequence of real numbers comprised between 0 and 1. In either case, values must be in increasing order. In the former case, the last (maximal) value must be smaller than the desired length of the final alignment. In the latter case, values are expressed relatively, and they will be converted to integer indexes by the method. In that case, if site positioning is non-trivial (typically, if conversion of positions to integer yield identical position for different conscutive sites), it will be resolved randomly. By default (if None), sites are placed regularly along the final alignment length. If int and float types are mixed, the first occurring type will condition what will happen.
  • alleles – String (or any sequence of one-character strings) providing the alleles to be used to fill non-varying positions of the resulting alignment. If there is more than one allele, the allele will be picked randomly for each site, independently for each interted site.
  • random – A Random instance to be used a random generator. By default, use a default instance.
is_matrix

True if the instance is an Align, and False if it is a Container.

iter_outgroup()

Iterator over the outgroup.

iter_samples(ingrp, outgrp)

Iterator over either ingroup samples or outgroup samples, or both ingroup and outgroup (in that order) samples. Provided for flexibility.

Parameters:
  • ingrp – include ingroup samples.
  • outgroup – include outgroup samples.
ls

Alignment length. This value cannot be set or modified directly. It is not possible to get the number of samples of a single ingroup or ingroup sample as they are all equal to this value.

name_mapping()

Generates a dictionary mapping names to SampleView instances representing all samples of this instance. This method is most useful when several sequences have the same name. It may be used to detect and process duplicates. It processes the ingroup only.

names()

Generate the list of ingroup sample names.

names_outgroup()

Generate the list of outgroup sample names.

nexus(prot=False)

Generates a simple nexus-formatted string. If prot is True, adds datatype=protein in the file, allowing it to be imported as proteins (but doesn’t perform further checking).

Returns a nexus-formatted string. Note: any spaces and tabs in sequence names are replaced by underscores. This nexus implementation is minimal but will normally suffice to export sequences to programs expecting nexus.

Note: only the ingroup is exported. The data must be exportable as strings.

ng

Number of group levels for ingroup (for outgroup, this number is always one). It possible to change the value directly.

no

Current number of outgroup samples (cannot be modified directly).

ns

Current number of samples, not considering the outgroup (cannot be modified directly).

phylip(format='I', ingroup=True, outgroup=False)

Returns a phyml-formatted string representing the content of the instance. The phyml format is suitable as input data for PhyML and PAML software. Raises a ValueError if any name of the instance contains at least one character of the following list: ()[]{},; or a space, tab, newline or linefeed. Group labels are never exported. Sequence names cannot be longer than 10 characters. A ValueError will be raised if a longer name is met. format must be ‘I’ or ‘S’ (case-independent), indicating whether the data should be formatted in the sequential (S) or interleaved (I) format (see PHYLIP’s documentation for definitions). The user is responsible of ensuring that all names are unique. If not, the exported file my cause subsequent programs to fail.

The sequences must all be convertible into characters.

Parameters:
  • ingroup – A boolean indicating whether the ingroup must be exported.
  • outgroup – A boolean indicating whether the outgroup must be exported.

Changed in version 3.0.0: Added the ingroup and outgroup options.

phyml(ingroup=True, outgroup=False, strict=True, dtype=None)

Returns a phyml-formatted string representing the content of the instance. The phyml format is suitable as input data for the PhyML and PAML programmes. Raises a ValueError if any name of the instance contains at least one character in the following list: ()[]{},; or a space, tab, newline or linefeed. Group information is never exported.

The sequences must all be convertible into characters.

Parameters:
  • ingroup – A boolean indicating whether the ingroup must be exported.
  • outgroup – A boolean indicating whether the outgroup must be exported.
  • strict – enforce that all characters within the instances are valid (assuming either nucleotide or amino acid sequences, see dtype). If False, characters are not checked. Also, enforce that their is no blank character, round bracket, colon, or comma in sequence names. When checking, both ingroup and outgroup are always checked.
  • dtype – one of None (default), nt (nucleotides), or aa (amino acids). Type of data assumed (only if strict is set to True). By default, allow for either nucleotides or amino acids (but not a combination of both).

Changed in version 3.0.0: Added the ingroup, outgroup, strict, and dtype options.

random_missing(rate, ch='N', valid='ACGTacgt', ingroup=True, outgroup=True, random=None)

Randomly introduces missing data in the current instance. Random positions of the alignment are changed to missing data. Only data that are currently non-missing data are considered.

Parameters:
  • rate – probability that a non-mssing (as defined after the valid argument) data is turned into missing data.
  • ch – missing data character, to be used for all replacements (as a single-character string or an integer).
  • valid – a string or list of integers (or any iterable, possibly missing integers and one-character strings) given the allele values that may be turned into missing data.
  • ingroup – A boolean indicating whether the ingroup must be processed.
  • outgroup – A boolean indicating whether the outgroup must be processed.
  • random – a Random instance. If None, the method will use a default instance.

Changed in version 2.1.0: Restricted to Align instances.

Changed in version 3.0.0: Takes the ch, valid, ingroup, outgroup and random arguments. Reimplementation dropping dependency on the non-default numpy package.

remove_duplicates()

Remove all duplicates, based on name exact matching. For all pairs of samples with identical name, only the one occurring first is conserved. The current instance is modified and this method returns None.

rename(mapping, liberal=False, include_outgroup=False)

Rename sequences of the instance using the provided mapping.

Parameters:
  • mapping – a dict providing the mapping of old names (as keys) to new names (which may, if needed, contain duplicated).
  • liberal – if this argument is False and a name does not appear in mapping, a ValueError is raised. If liberal is True, names that don’t appear in mapping are left unchanged.
  • include_outgroup – if True, consider both ingroup and outgroup samples together (they should be provided together in the same mapping).
Returns:

The number of samples that have been actually renamed, overall.

New in version 2.0.1.

Changed in version 3.0.0: Added an option for the outgroup. Added return value.

reserve(nsam=0, nout=0, lnames=0, ngrp=0, nsit=0)

Pre-allocate memory. This method can be used when the size of arrays is known a priori, in order to speed up memory allocations. It is not necessary to set all values. Values less than 0 are ignored.

Parameters:
  • nsam – number of samples in the ingroup.
  • nout – number of samples in the outgroup.
  • lnames – length of sample names.
  • ngrp – number of group labels.
  • nsit – number of sites.
reset()

Reset the instance.

set_i(sample, site, value)

Set an ingroup data entry. The value must be a signed integer or a one-character string. In the latter case, ord() is called on the behalf of the user.

Parameters:
  • sample – sample index.
  • site – site index.
  • value – allele value.
set_label(sample, level, value)

Set a group label.

Parameters:
  • sample – sample index.
  • level – level index.
  • value – new group value (unsigned integer).
set_label_o(sample, value)

Get an outgroup label.

Parameters:
  • sample – sample index.
  • value – new group value (unsigned integer).
set_name(index, name)

Set the name of a sample.

Parameters:
  • index – index of the sample
  • name – new name value.
set_name_o(index, name)

Set the name of an outgroup sample.

Parameters:
  • index – index of the outgroup sample
  • name – new name value.
set_o(sample, site, value)

Set an outgroup data entry. The value must be a signed integer or a one-character string. In the latter case, ord() is called on the behalf of the user.

Parameters:
  • sample – sample index.
  • site – site index.
  • value – allele value.
set_outgroup(index, name, data, group=None)

Set the values for a sample of the outgroup

Parameters:
  • index – index of the sample to access (slices are not permitted).
  • name – new name of the sample.
  • data – string or list of integers given the new values to set. In case of an Align, it is required to pass a sequence with length matching the number of sites of the instance.
  • groups – a positive integer to serve as group label (the default value is 0.
set_sample(index, name, data, groups=None)

Set the values for the sample corresponding to the requested index.

Parameters:
  • index – index of the sample to access (slices are not permitted).
  • name – new name of the sample.
  • data – string or list of integers given the new values to set. In case of an Align, it is required to pass a sequence with length matching the number of sites of the instance.
  • groups – a list of integer label values, or a single integer value, to set as group labels. The class will ensure that all samples have the same number of group labels, padding with 0 as necessary. The default corresponds to an empty list.
set_sequence(index, value)

Replace all data entries for a given ingroup sample by new values.

Parameters:
  • index – ingroup sample index.
  • value – can be a SequenceView instance, a string or a list of integers (or any iterable with a length which may mix single-character strings and integers). When modifying an Align instance, all sequences must have the same length as the current alignment length.
set_sequence_o(index, value)

Replace all data entries for a given outgroup sample by new values.

Parameters:
  • index – outgroup sample index.
  • value – can be a SequenceView instance, a string or a list of integers (or any iterable with a length which may mix single-character strings and integers). When modifying an Align instance, all sequences must have the same length as the current alignment length.
shuffle(level=0, random=None)

Shuffle group labels.

Randomly reassigns group labels. Modifies the current instance and returns None. Only the specified level is affected, and only the group labels are modified (the order of samples is not changed.

Parameters:
  • level – index of the group level to shuffle. To shuffle the outgroup’s labels, set level to None.
  • random – A Random instance to be used a random generator. By default, use a default instance.

Changed in version 3.0.0: The outgroup is necessarily processed separately from the rest of the instance. Allow pass a Random instance. Use library’s own random number generator.

slider(wwidth, wstep)

Provides a means to perform sliding-windows analysis over the alignment. This method returns a generator that can be used as in for window in align.slider(wwidth, wstep), where each step window of the iteration will be the reference to a Align instance of length wwidth (or less if not enough sequence is available near the end of the alignment). Each step moves forward following the value of wstep.

Changed in version 3.0.0: The returned Align is actually a reference to the same object, which is stored within the instance and repeatively returned.

subgroup(groups, outgroup_all=False)

Generate and return a copy of the instance with only samples from the specified groups (identified by their group labels, including outgroup).

Parameters:
  • groups – an integer or a dictionary providing the labels of the groups that are selected. If an integer, it is understood as an ingroup label corresponding to the first grouping level (if more than one). If a dictionary, the keys of this dictionary must be integer corresponding to grouping levels (or None for the outgroup) and the values must be lists containing the requested labels. It is not required to include all levels and the outgroup in the dictionary. It is allowed to specify labels that are actually not represented in the data.
  • outgroup_all – if True, always include all outgroup samples in the returned instance, regardless of whether their group labels are specified in the first argument.

New in version 3.0.0.

subset(samples, outgroup=None)

Generate and return a copy of the instance with only a specified list of samples. It is possible to select ingroup and/or outgroup samples and the sample indexes are not required to be consecutive.

Parameters:
  • samples – a list (or other iterable type with a length) of sample indexes giving the list of ingroup samples that must be exported to the return value object. If None, do not export any ingroup samples.
  • outgroup – a list (or other iterable type with a length) of sample indexes giving the list of outgroup samples that must be exported to the return value object. If None, do not export any outgroup samples.

New in version 3.0.0: Established as a method for Align and Container.

to_fasta(fname=None, first=0, last=4294967295, mapping=None, groups=False, shift_labels=False, include_outgroup=True, linelength=50)

Export alignment in the fasta format.

Parameters:
  • fname – Name of the file to export data to. By default, the file is created (or overwritten if it already exists). If the option append is True, data is appended at the end of the file (and it must exist). If fname is None (default), no file is created and the formatted data is returned as a str. In the alternative case, nothing is returned.
  • first – If only part of the sequences should be exported: index of the first sequence to export.
  • last – If only part of the sequences should be exported: index of the last sequence to export. If the value is larger than the index of the last sequence, all sequences are exported until the last (this is the default). If last*<*first, no sequences are exported.
  • mapping – A string providing the character mapping. Use the specified list of characters to map integer allelic values. If a non-empty string is provided, the length of the string must be larger than the largest possible allele values. In that case, the allele values will be used as indexes in order to determine which character from this string must be used for outputting. In the case that this method is used with an empty string, the mapping will not be used and the allele values will be casted directly to characters.
  • groups – A boolean indicating whether group labels should be exported, or ignored.
  • shift_labels – A boolean indicating whether group labels should be incremented of one unit when exporting.
  • include_outgroup – A boolean indicating whether the outgroup should be exported. It will be exported at the end of the ingroup, and without discriminating label if include_labels is False.
  • linelength – The length of lines for internal breaking of sequences.
to_outgroup(index, label=0)

Transfer a sample to the outgroup.

Parameters:
  • index – index of the sample to move to the outgroup.
  • label – label to assign to the sample (usually, this label assigned the sample to an individual).

Container

class egglib.Container

Bases: egglib._interface.DataBase

Holds a data set with associated sample names and group information. The data consists of given numbers of ingroup and outgroup samples, and each of those may have a different number of sites. There can be any number of group levels (but this number must be the same for all samples), meaning that samples can be described by several group labels in addition to their name. Group labels are not group indices (they do not need to be consecutive). There is a separate data set for samples belonging to the outgroup. There can be any number of outgroup samples. Outgroup samples always have one level of group labels that should be used to specify individuals (when appropriate). All data are represented by signed integers

Default instance is empty (0 samples and 0 sites).

New in version 3.0.0: Reimplementation of the Container class.

add_outgroup(name, data, group=None)

Add an outgroup sample to the instance.

Parameters:
  • name – name of the new sample.
  • data – an iterable (may be a string or a list) containing the data to set for the new sample. For an Align instance and if the sample is not the first, the number of data must fit any previous ones (ingroup or outgroup).
  • group – if not None, must be an unsigned integer (the default value is 0).
add_sample(name, data, groups=None)

Add a sample to the instance.

Parameters:
  • name – name of the new sample.
  • data – an iterable (may be a string or a list) containing the data to set for the new sample. For an Align instance and if the sample is not the first, the number of data must fit any previous ones (including outgroup samples, if any).
  • groups – if not None, must be an iterable with an unsigned integer for each group levels of the instance (if None, all group labels are set to 0), or a single integer value if only one level needs to be specified.
add_samples(items)

Add several samples at the end of the instance.

Parameters:items – items must be an iterable that have a length (for example a list, an Align instance or a Container instance. Each item of items must be of length 2 or 3. The first item must be the sample name string, the second item is the data values (as a list of signed integers or a single string) and the third item, if provided, is the list of group labels (one unsigned integer for each level). See the method add_sample() for more details about each added item. If the current instance is an Align, all added items must have the same length which must be the same as any items currently present in the instance. For a Container, the items may have different lengths. The number of group levels is set by the sample with the larger number of group labels, and all samples with less labels are completed by e default value (0).

New in version 2.0.1: Original name is addSequences().

Changed in version 3.0.0: Renamed as add_samples. All items are added in one shot (rather than calling the one-sample add method iteratively).

clear()

Clear the instance and release all memory. In most cases, it is preferable to use the method reset().

classmethod create(obj)

Create a new instance by copying date from the data container passed as obj. The object obj can be:

  • an Align,
  • a Container (but all sequences are required to have the same length if the target type is Align),
  • any iterable type yielding a acceptable items (see the documentation for add_samples() for more details).

New in version 2.0.1.

del_outgroup(index)

Delete an item of the instance.

Parameters:index – index of the sample to delete.
del_sample(index)

Delete an item of the instance.

Parameters:index – index of the sample to delete.
del_sites(sample, site, num=1)

Delete data entries from an ingroup sample.

By default (if num=1), remove a single site. If num is larger than 1, remove a range of sites.

Parameters:
  • sample – ingroup sample index.
  • site – index of the (first) site to remove. This site must be a valid index.
  • num – maximal number of sites to remove. The value cannot be negative.
del_sites_o(sample, site, num=1)

Delete data entries from an outgroup sample.

By default (if just num=1), remove a single site. If num is larger than 1, remove a range of sites.

Parameters:
  • sample – outgroup sample index.
  • site – index of the (first) site to remove. This site must be a valid index.
  • num – maximal number of sites to remove. The value cannot be negative.
encode(nbits=10, random=None, include_outgroup=False)

Renames all sequences using a random mapping of names to unique keys of length nbits.

Parameters:
  • nbits – length of the keys (encoded names). This value must be >= 4 and <= 63.
  • random – a Random instance to be used a random generator. By default, use a default instance.
  • include_outgroup – a bool indicating whether the outgroup samples should also be considered (in this case, a single mapping is returned including both ingroup and outgroup samples).
Returns:

A dictionary mapping all the generated keys to the actual sequence names. The keys are case-dependent and guaranteed not to start with a number.

The returned mapping can be used to restore the original names using rename(). This method is not affected by the presence of sequences with identical names in the original instance (and rename() will also work properly in that case).

New in version 2.0.1.

Changed in version 3.0.0: Keys are forced to start with a capital letter. Now take a Random instance. Use library’s own random number generator. Added an option for the outgroup.

equalize(value='?')

Extend sequences such as they all have the length of the longest sequence (over both ingroup and outgroup).

Parameters:value – the value to use to extend sequences, as an integer or a single-character string.
find(name, include_outgroup=False, regex=False, multi=False, flags=None, index=False)

Find a sample by its name.

Parameters:
  • name – name of sample to identify.
  • include_outgroup – a boolean indicating whether the outgroup should be considered.
  • regex – a boolean indicating whether the value passed a name is a regular expression. If so, the string is passed as is to the re module (using function re.search()). Otherwise, only exact matches will be considered.
  • multi – a boolean indicating whether all hits should be returned. If so, a list of SampleView instances is always returned (the list will be empty in case of no hits). Otherwise, a single SampleView instance (or its index) will be returned for the first hit, or None in case of no hits.
  • flags – list of flags to be passed to re.search() (ignored if regex is False). For example, when looking for samples containing the term “sample”, for being case insensitive, use the following syntax: align.find("sample", regex=True, flags=[re.I]). By default (None) no further argument is passed.
  • index – boolean indicating whether the index of the sample should be returned. In that case return values for hits are int (by default, SampleView) instances. Warning: it is not allowed to set both include_outgroup and index to True as there would be no way to distinguish between ingroup and outgroup indexes.
Returns:

None if no hits were found, a (potentially empty) list of SampleView instances or class:int if multi is True, or a single SampleView or int otherwise.

find_motif(sample, motif, start=0, stop=None)

Locate the first instance of a motif for an ingroup sample

Returns the index of the first exact hit to a given substring. The returned value is the position of the first base of the hit. Only exact matches are implemented. To use regular expression (for example to find degenerated motifs), one should extract the string for the sequence and use a tool such as the regular expression module (re).

Parameters:
  • sample – sample index.
  • motif – a list of integer, or one-character strings (or mixing both) or a string constituting the motif to search.
  • start – position at which to start searching. The method will never return a value smaller than start. By default, search from the start of the sequence.
  • stop – position at which to stop search (the motif cannot overlap this position). No returned value will be larger than stop-n. By default, or if stop is equal to or larger than the length of the sequence, search until the end of the sequence.
find_motif_o(sample, motif, start=0, stop=None)

Locate the first instance of a motif for an outgroup sample

Returns the index of the first exact hit to a given substring. The returned value is the position of the first base of the hit. Only exact matches are implemented. To use regular expression (for example to find degenerated motifs), one should extract the string for the sequence and use a tool such as the regular expression module (re).

Parameters:
  • sample – outgroup sample index.
  • motif – a list of integer, or one-character strings (or mixing both) or a string constituting the motif to search.
  • start – position at which to start searching. The method will never return a value smaller than start. By default, search from the start of the sequence.
  • stop – position at which to stop search (the motif cannot overlap this position). No returned value will be larger than stop minus the length of the motif. By default, or if stop is equal to or larger than the length of the sequence, search until the end of the sequence.
get_i(sample, site)

Get a data entry.

Parameters:
  • sample – sample index.
  • site – site index.
get_label(sample, level)

Get a group label.

Parameters:
  • sample – sample index.
  • level – level index.
get_label_o(sample)

Get an outgroup label.

Parameters:sample – sample index.
get_name(index)

Get the name of a sample.

Parameters:index – sample index.
get_name_o(index)

Get the name of an outgroup sample.

Parameters:index – outgroup sample index.
get_o(sample, site)

Get an outgroup data entry.

Parameters:
  • sample – outgroup sample index.
  • site – site index.
get_outgroup(index)

Get the SampleView instance corresponding to the requested index for the outgroup. The returned object allows to modify the underlying data.

Parameters:index – index of the outgroup sample to access.
get_sample(index)

Get the SampleView instance corresponding to the requested index. The returned object allows to modify the underlying data.

Parameters:index – index of the sample to access.
get_sequence(index)

Access to the data entries of a given ingroup sample. Returns a SequenceView instance which allow modifying the underlying sequence container object.

Parameters:index – ingroup sample index.
get_sequence_o(index)

Access to the data entries of a given outgroup sample. Returns a SequenceView instance which allow modifying the underlying sequence container object.

Parameters:index – outgroup sample index.
group_mapping(level=0, indices=False)

Generates a dictionary mapping group labels to either SampleView instances (by default) or their indexes representing all samples of this instance. It can process and ingroup label level or the outgroup.

Parameters:
  • level – index of the level to consider. If None, processes the outgroup.
  • indices – if True, represent samples by their index (within the ingroup or the outgroup depending on level being non-None or None) instead of SampleView instances.
insert_sites(sample, position, values)

Insert sites at a given position for a given ingroup sample

Parameters:
  • sample – index of the ingroup sample to which insert sites.
  • position – the position at which to insert sites. Sites are inserted before the specified position, so the user can use 0 to insert sites at the beginning of the sequence. To insert sites at the end of the sequence, pass the current length of the sequence. If position is larger than the length of the sequence or None, new sites are inserted at the end of the sequence. The position might be negative Warning: the position -1 means before the last position.
  • values – a list of signed integers, or a string, providing data to insert into the instance.
insert_sites_o(sample, position, values)

Insert sites at a given position for a given outgroup samples

Parameters:
  • sample – index of the outgroup sample to which insert sites.
  • position – the position at which to insert sites. Sites are inserted before the specified position, so the user can use 0 to insert sites at the beginning of the sequence. To insert sites at the end of the sequence, pass the current length of the sequence. If position is larger than the length of the sequence or None, new sites are inserted at the end of the sequence. The position might be negative Warning: the position -1 means before the last position.
  • values – a list of signed integers, or a string, providing data to insert into the instance.
is_matrix

True if the instance is an Align, and False if it is a Container.

iter_outgroup()

Iterator over the outgroup.

iter_samples(ingrp, outgrp)

Iterator over either ingroup samples or outgroup samples, or both ingroup and outgroup (in that order) samples. Provided for flexibility.

Parameters:
  • ingrp – include ingroup samples.
  • outgroup – include outgroup samples.
lo(index)

Get the number of sites of an outgroup sample.

Parameters:index – sample index.
ls(index)

Get the number of sites of an ingroup sample.

Parameters:index – sample index.
name_mapping()

Generates a dictionary mapping names to SampleView instances representing all samples of this instance. This method is most useful when several sequences have the same name. It may be used to detect and process duplicates. It processes the ingroup only.

names()

Generate the list of ingroup sample names.

names_outgroup()

Generate the list of outgroup sample names.

ng

Number of group levels for ingroup (for outgroup, this number is always one). It possible to change the value directly.

no

Current number of outgroup samples (cannot be modified directly).

ns

Current number of samples, not considering the outgroup (cannot be modified directly).

remove_duplicates()

Remove all duplicates, based on name exact matching. For all pairs of samples with identical name, only the one occurring first is conserved. The current instance is modified and this method returns None.

rename(mapping, liberal=False, include_outgroup=False)

Rename sequences of the instance using the provided mapping.

Parameters:
  • mapping – a dict providing the mapping of old names (as keys) to new names (which may, if needed, contain duplicated).
  • liberal – if this argument is False and a name does not appear in mapping, a ValueError is raised. If liberal is True, names that don’t appear in mapping are left unchanged.
  • include_outgroup – if True, consider both ingroup and outgroup samples together (they should be provided together in the same mapping).
Returns:

The number of samples that have been actually renamed, overall.

New in version 2.0.1.

Changed in version 3.0.0: Added an option for the outgroup. Added return value.

reserve(nsam=0, nout=0, lnames=0, ngrp=0, nsit=0)

Pre-allocate memory. This method can be used when the size of arrays is known a priori, in order to speed up memory allocations. It is not necessary to set all values. Values less than 0 are ignored.

Parameters:
  • nsam – number of samples in the ingroup.
  • nout – number of samples in the outgroup.
  • lnames – length of sample names.
  • ngrp – number of group labels.
  • nsit – number of sites.
reset()

Reset the instance.

set_i(sample, site, value)

Set an ingroup data entry. The value must be a signed integer or a one-character string. In the latter case, ord() is called on the behalf of the user.

Parameters:
  • sample – sample index.
  • site – site index.
  • value – allele value.
set_label(sample, level, value)

Set a group label.

Parameters:
  • sample – sample index.
  • level – level index.
  • value – new group value (unsigned integer).
set_label_o(sample, value)

Get an outgroup label.

Parameters:
  • sample – sample index.
  • value – new group value (unsigned integer).
set_name(index, name)

Set the name of a sample.

Parameters:
  • index – index of the sample
  • name – new name value.
set_name_o(index, name)

Set the name of an outgroup sample.

Parameters:
  • index – index of the outgroup sample
  • name – new name value.
set_o(sample, site, value)

Set an outgroup data entry. The value must be a signed integer or a one-character string. In the latter case, ord() is called on the behalf of the user.

Parameters:
  • sample – sample index.
  • site – site index.
  • value – allele value.
set_outgroup(index, name, data, group=None)

Set the values for a sample of the outgroup

Parameters:
  • index – index of the sample to access (slices are not permitted).
  • name – new name of the sample.
  • data – string or list of integers given the new values to set. In case of an Align, it is required to pass a sequence with length matching the number of sites of the instance.
  • groups – a positive integer to serve as group label (the default value is 0.
set_sample(index, name, data, groups=None)

Set the values for the sample corresponding to the requested index.

Parameters:
  • index – index of the sample to access (slices are not permitted).
  • name – new name of the sample.
  • data – string or list of integers given the new values to set. In case of an Align, it is required to pass a sequence with length matching the number of sites of the instance.
  • groups – a list of integer label values, or a single integer value, to set as group labels. The class will ensure that all samples have the same number of group labels, padding with 0 as necessary. The default corresponds to an empty list.
set_sequence(index, value)

Replace all data entries for a given ingroup sample by new values.

Parameters:
  • index – ingroup sample index.
  • value – can be a SequenceView instance, a string or a list of integers (or any iterable with a length which may mix single-character strings and integers). When modifying an Align instance, all sequences must have the same length as the current alignment length.
set_sequence_o(index, value)

Replace all data entries for a given outgroup sample by new values.

Parameters:
  • index – outgroup sample index.
  • value – can be a SequenceView instance, a string or a list of integers (or any iterable with a length which may mix single-character strings and integers). When modifying an Align instance, all sequences must have the same length as the current alignment length.
shuffle(level=0, random=None)

Shuffle group labels.

Randomly reassigns group labels. Modifies the current instance and returns None. Only the specified level is affected, and only the group labels are modified (the order of samples is not changed.

Parameters:
  • level – index of the group level to shuffle. To shuffle the outgroup’s labels, set level to None.
  • random – A Random instance to be used a random generator. By default, use a default instance.

Changed in version 3.0.0: The outgroup is necessarily processed separately from the rest of the instance. Allow pass a Random instance. Use library’s own random number generator.

subgroup(groups, outgroup_all=False)

Generate and return a copy of the instance with only samples from the specified groups (identified by their group labels, including outgroup).

Parameters:
  • groups – an integer or a dictionary providing the labels of the groups that are selected. If an integer, it is understood as an ingroup label corresponding to the first grouping level (if more than one). If a dictionary, the keys of this dictionary must be integer corresponding to grouping levels (or None for the outgroup) and the values must be lists containing the requested labels. It is not required to include all levels and the outgroup in the dictionary. It is allowed to specify labels that are actually not represented in the data.
  • outgroup_all – if True, always include all outgroup samples in the returned instance, regardless of whether their group labels are specified in the first argument.

New in version 3.0.0.

subset(samples, outgroup=None)

Generate and return a copy of the instance with only a specified list of samples. It is possible to select ingroup and/or outgroup samples and the sample indexes are not required to be consecutive.

Parameters:
  • samples – a list (or other iterable type with a length) of sample indexes giving the list of ingroup samples that must be exported to the return value object. If None, do not export any ingroup samples.
  • outgroup – a list (or other iterable type with a length) of sample indexes giving the list of outgroup samples that must be exported to the return value object. If None, do not export any outgroup samples.

New in version 3.0.0: Established as a method for Align and Container.

to_fasta(fname=None, first=0, last=4294967295, mapping=None, groups=False, shift_labels=False, include_outgroup=True, linelength=50)

Export alignment in the fasta format.

Parameters:
  • fname – Name of the file to export data to. By default, the file is created (or overwritten if it already exists). If the option append is True, data is appended at the end of the file (and it must exist). If fname is None (default), no file is created and the formatted data is returned as a str. In the alternative case, nothing is returned.
  • first – If only part of the sequences should be exported: index of the first sequence to export.
  • last – If only part of the sequences should be exported: index of the last sequence to export. If the value is larger than the index of the last sequence, all sequences are exported until the last (this is the default). If last*<*first, no sequences are exported.
  • mapping – A string providing the character mapping. Use the specified list of characters to map integer allelic values. If a non-empty string is provided, the length of the string must be larger than the largest possible allele values. In that case, the allele values will be used as indexes in order to determine which character from this string must be used for outputting. In the case that this method is used with an empty string, the mapping will not be used and the allele values will be casted directly to characters.
  • groups – A boolean indicating whether group labels should be exported, or ignored.
  • shift_labels – A boolean indicating whether group labels should be incremented of one unit when exporting.
  • include_outgroup – A boolean indicating whether the outgroup should be exported. It will be exported at the end of the ingroup, and without discriminating label if include_labels is False.
  • linelength – The length of lines for internal breaking of sequences.
to_outgroup(index, label=0)

Transfer a sample to the outgroup.

Parameters:
  • index – index of the sample to move to the outgroup.
  • label – label to assign to the sample (usually, this label assigned the sample to an individual).

Tree

class egglib.Tree(fname=None, string=None)

This class Handle trees. A tree is a linked collection of nodes which all have one parent (except the ultimate base node) and any number of children.

Nodes are implemented as Node instances. A node without children is a leaf (others are internal). A node with exactly one child is generally meaningless, but is allowed. All nodes (internal nodes as well as leaves) have a label which in the case of leaves can be used as leaf name. It is not possible to apply a name and a label to leaf node, in agreement with the newick format. All connections between nodes (branches) are oriented and can have a length (although the lengths can be omitted) but note that labels are applied to nodes, not branches. All Tree instances have at least one base node which is the only one allowed not to have a parent. Network-like structures are not allowed (because nodes must have exactly one parent).

Import and export to/from strings and files are in the bracket-based newick format (the parser treats terminal node labels as strings, and internal node labels as integers or, by default, floats). Tree instances can be exported using the built-in str function, and the method newick(). Nodes also have a newick() method.

Tree instances are iterable. Three iterators are provided: one is depth-first (depth_iter()), another is breath-first (breadth_iter()) and one iterates on terminal nodes (leaves) only (iter_leaves()).

The instance can be initialized as an empty tree (with only a root node), or from a newick-formatted string. By default, the string is read from the file name passed as the fname argument to the constructor, but it can be passed directly through the constructor argument string. It is not allowed to set both fname and string at the same time. The newick parser expects a well-formed newick string (including the trailing semicolon).

Changed in version 2.0.1: (This change concerns the constructor.) Imports directly from a file. If a string is passed, it is interpreted as a file name by default.

Changed in version 3.0.0: Several interface changes. Trees are not allowed to be network at all anymore.

add_node(parent, label=None, brlen=None)

Add a node to the tree.

Parameters:
  • parent – one of the nodes of this instance, as a Node reference.
  • label – node label which will be internal node label or leaf name according to the final structure of the tree. The new node has initially no children and is therefore a leaf until it is itself connected to a child (if ever).
  • brlen – length of the branche connecting parent to the new node.
Returns:

The new node as a Node reference.

base

Basal node of the tree (if the tree is unrooted, it is a trifurcation whose location should be considered as arbitrary; if the tree is rooted, it is the root). This attribute is a Node instance which can be modified, but it cannot be replaced.

breadth_iter(start=None)

Return a breadth-first iterator. Iterate over the Node instances of the trees, starting from the base but, then, following a breadth-first order.

Parameters:start – start point of the iteration, as a Node instance of this tree. By default, start from the base of the tree.
clean_branch_lengths()

Remove all branch lengths. In practice, they are set to None.

clean_internal_labels()

Remove all internal node labels, including the base of the tree. In practice, they are set to None.

collapse(node, ignore_len=False, ignore_label=False)

Collapse a branch of the tree.

Parameters:
  • nodeNode representing the branch to remove.
  • ignore_len – don’t try to transfer branch lengths to children.
  • ignore_label – don’t transfer label of the destroyed node to its parent.

node represents the branch that must be removed from the tree (this node is destroyed in the process). It must be one of the nodes contained in the tree (as a Node instance), but not the base of the tree. It cannot be an terminal node (leaf).

If ignore_label is not set to True, the label of the destroyed node is transferred to the parent based on the following procedure: (1) if the destroyed node’s label is None, nothing is done; (2) if the destroyed node’s parent’s label is None, the destroyed node’s label is copied to its parent as is; (3) otherwise both labels are converted to strings (if they are not yet) and concatenated as in the string a;b (where a is the parent’s label and b is the destroyed node’s label), even if the two labels are identical.

If ignore_len is not set to True, and if the length of the removed branch (branch from the specified node to its parent) is specified, it will be spread equally among the branches to its children (see example below). This requires that the branch length to all children are specified. If the removed branch has no specified length, nothing is done.

Collapsing node [4] on the following tree:

 /------------------------------------------->[1]
 |
 |             /----------------------------->[3]
 |             |
 |----------->[2]             /-------------->[5]
 |             |              |
 |             \------------>[4]
[0]                           |
 |                            \-------------->[6]
 |
 |              /---------------------------->[8]
 |              | 
 \------------>[7]            /------------->[10]
                |             |
                \----------->[9]
                              |
                              \------------->[11]

will generate the following tree, with the correction of edge lengths as depicted:

 /------------------------------------------->[1]
 |
 |             /----------------------------->[3]
 |             |
 |----------->[2]
 |             |
 |             |-------------------->[5]        L5 = L5+L4/2
[0]            |
 |             \-------------------->[6]        L6 = L6+L4/2
 |
 |              /---------------------------->[8]
 |              | 
 \------------>[7]            /------------->[10]
                |             |
                \----------->[9]
                              |
                              \------------->[11]

Although the total edge length of the tree is not modified, the relationships will be altered: the distance between the descendants of the collapsed node (nodes 5 and 6 in the example above) will be artificially increased.

copy(node=None)

Create a new instance of Tree that is a deep copy of a subtree of the the current tree.

Parameters:node – a Node instance (one of the nodes of the current tree) at the base of the subtree that should be copied. By default, or if None or the base of the tree is passed, the whole tree is copied. It is not allowed to pass a leaf.
depth_iter(start=None)

Return a depth-first iterator. Iterate over the Node instances of the trees, starting from the base but, then, following a depth-first order.

Parameters:start – start point of the iteration, as a Node instance of this tree. By default, start from the base of the tree.
extract(node, label=None)

Remove a subtree of the current tree and return it as a new instance of Tree. All nodes of the subtree descending from the requested node will now belong to the new tree. The label of the node at the base of the selected clade is deleted. In the original tree, the extracted clade is replaced by a terminal node which has, by default, the label of the node at the base of the extracted clade, or the label passed as argument.

Parameters:
  • node – a Node instance (one of the nodes of the current tree) at the base of the subtree that should be extracted. It is not allowed to pass a leaf or the base of the tree.
  • label – label to affect to the terminal node that is introduced to replace the extracted clade in the orginal tree. By default (or if None), use the label of the node passed as first argument (note that this label should be in principle a number).
find_clade(names, ancestral=False, both_sides=False)

Check whether a group is one of the clades defined by the tree.

The leaf names must be provided as a list, or any other iterable (ideally, a set). Leaf names are normally str instances. All leaves must be present in the tree.

With the option ancestral=False, search for the clade that contains the provided list of names as descendant. There must not be any other name amongst descendant. If the tree is unrooted and the representation uses a base within the clade, it will not be detected. It possible (with the option both_sides=True to allow searching for the complement of the clade, thereby detecting the right clade even if it is at the root. By default, if this situation occurs, the clade will not be detected. Return None if the clade is not found.

With the option ancestral=True, search for the most recent common ancestor of all leaves specified in names. Use of this method necessarily supposes that the tree is rooted (however, there is no requirements regarding its shape such as bifurcation at the base) and it is not allowed to specify both_sides=True. With this option, it is not possible to have a None return value (since it is required that all leaves are present in the tree, in the worse case the base of the tree is returned).

Parameters:
  • name – a set (or compatible) specifying the requested leaves (as node labels, normally str instances).
  • ancestral – whether to look for the most recent common ancestral clade containing requested leaves (by default, looking for the clade containing the exact same list of leaves).
  • both_sides – only allowed with ancestral=False. Look for both the requested list of leaves and its complement, allowing to detect a clade even if it is spanning the base of the tree.
Returns:

The Node instance, if it exists, which has the exact same list of descendants than taxa. If no such clade is found, returns None.

Warning

This method assumes that all leaf names of the tree are unique, as well as the list of names provided as argument. If this condition is not fulfilled, the right clade might not be found even if it exists.

Changed in version 3.0.0: Replaces previous methods findGroup(), findMonophyleticGroup(), smallest_group() and smallest_monophyleticGroup() with a modification of the underlying algorithm.

frequency_nodes(trees, relative=False)

Labels all nodes of the current instances by integers counting the number of trees where the same node exists among the trees in the iterable trees. It is required that all leaf labels are unique.

Parameters:
  • trees – an iterable containing Tree instances with exactly the same set of leaf labels.
  • relative – node frequencies are expressed as fractions. The use of this option requires that at least one tree is provided.
Returns:

Nothing (result is available as node labels).

Warning

With the exception of the base of the tree (which is ignored by this function) and leaf labels, all previously set labels are erased.

Changed in version 3.0.0: Labels are not converted to strings anymore.

get_leaf(label)

Return the terminal node (as a Node instance) that has the requested leaf label. If several nodes have this label, returns the first one. If no nodes have this label, returns None.

iter_leaves()

Return an iterator over the leaf nodes.

New in version 3.0.0: Replaces the method get_terminal_nodes() (the user needs to get labels of those to replace the method all_leaves()).

lateralize(reverse=False)

Modify the order of children of all node of the trees in such a way such that the children are sorted from the smallest to the largest number of descending leaves.

Parameters:reverse – sort from in the more-descendants to less-descendants order instead.
map_descendants()

Generate and return a dict where keys are all nodes (as Node instances) of the tree excepting its base and all terminal node, and values are the tuple of leaf labels that descend from this node.

New in version 3.0.0.

midroot()

Automatic midpoint rooting of the tree. The tree must be initially unrooted (trifurcation at the root). This method identifies the most distant pair of terminal nodes (in case of a draw, one is picked randomly) and the root of the tree (as a new node) placed at the middle point of this path.

newick(labels=True, brlens=True)

Return the newick-formatted string representing the instance.

Parameters:
  • labels – if False, omit the internal branch labels.
  • brlens – if False, omit the branch lengths.
num_leaves

Number of terminal nodes in the tree. If the tree is empty (only the default base node), the number of leaves is 0.

num_nodes

Total number of nodes in the tree (the number is never smaller than 1, even for empty trees).

remove_node(node, drop_parent=True)

Remove a node from the tree, as well as all its descendants. Since this operation may create a node with a single child, this method may remove the parent or the brother of the removed node depending on the structure of the tree (see drop_parent), unless specified otherwise.

Parameters:
  • node – the node to remove, as a Node instance belowing to the current tree. Terminal nodes can be removed, but not the base of the tree.
  • drop_parent – if True, remove the parent of the removed node if it is left with only one child. If the parent is the base of the tree, remove the other descendant if it is not terminal (see below).

Assume we remove node [3] from the tree with this structure:

               /---------------------------->[2]
               |
 /----------->[1]           /--------------->[4]
 |             |            |
 |             \---------->[3]
 |                          |
[0]                         \--------------->[5]
 |
 |              /--------------------------->[7]
 |              | 
 \------------>[6]            /------------->[9]
                |             |
                \----------->[8]
                              |
                              \------------>[10]

Then, we would end up with the following tree:

 /----------->[1]--------------------------->[2]
 |
 |
[0]
 |              /--------------------------->[7]
 |              | 
 \------------>[6]            /------------->[9]
                |             |
                \----------->[8]
                              |
                              \------------>[10]

The default behaviour is then to remove node [1] (and delete its label if it exists) and to set the length of the branch from [0] to [2] to the sum of the [0] to [1] and [1] to [2]. But, with drop_parent=False, the tree is left as is.

There is a special case with the base of the tree. Assume that we remove node [1] from the original tree above. We then would have a non-standard structure with a single child to the base of the tree:

                /--------------------------->[7]
                | 
[0]----------->[6]            /------------->[9]
                |             |
                \----------->[8]
                              |
                              \------------>[10]

In that case, the base is not removed, but node [6] is removed using the collapse() method (using options ignore_len=False but ignore_label=True since the base is not suppose to bear a label). We end up with the following structure:

 /--------------------------------->[7]
 | 
[0]                  /------------->[9]
 |                   |
 \----------------->[8]
                     |
                     \------------>[10]

The length of the branch from [0] to [6] is spread equally between the branch from [0] to [7] and the branch from [0] to [8] (and so on if there are actually more than one descendants). if drop_parent=False or if [6] is a terminal node, it is not removed.

root(outgroup, branch_split=0.5, reoriente=False)

Roots or reoriente the tree. By default, a new node is created to represent the root and is placed on the branch leading to the provided outgroup node (the second argument determines where the new node is placed on this branch). Otherwise, the tree is reoriented such as its base is placed at the location of the provided outgroup. In the former case, its ends with a bifurcation at the root; in the latter case, a trifurcation.

Parameters:
  • outgroupNode instance contained in this tree. It can be a leaf or any internal node, but not the current base of the tree (unless reoriente is True: in that case, it might be the base of the tree [it will not change anything] and it cannot be a leaf).
  • branch_split – where to cut the branch leading to the outgroup.
  • reoriente – don’t create any root node (branch_site is therefore not considered) and only place the node provided as the outgroup argument at the base of the tree, thereby merely changing the representation of the tree.

The information below describes the case where reoriented=False (proper rooting).

If this branch to the provided outgroup doesn’t have a branch length, the branch_split argument is ignored. Otherwise, branch_split must be a real number between 0 and 1 and gives the proportion of the branch that must be allocated to the basal branch leading to the outgroup, the complement being allocated to the branch leading to the rest of the tree. If branch_split is either 0 or 1, one of the branch will have a length of 0, but it will exist anyway.

It is illegal to call this method on trees that are already rooted (have a difurcation at the root).

If the original tree has this structure:

 /------------------------------------------->[1]
 |
 |             /----------------------------->[3]
 |             |
 |----------->[2]             /-------------->[5]
 |             |              |
 |             \------------>[4]
[0]                           |
 |                            \-------------->[6]
 |
 |              /---------------------------->[8]
 |              | 
 \------------>[7]            /------------->[10]
                |             |
                \---[ROOT]-->[9]
                              |
                              \------------->[11]

And rooting is requested at node [9], the root will be placed on the branch marked by [ROOT]. The outcome will be as depicted below, with the introduction of a new node (marked [ROOT]) and the reorientation of the tree to place it at the base:

                     /-------------------------[1]
                     |
              /-----[0]     /------------------[3]
              |      |      |
              |      \-----[2]      /----------[5]
              |             |       |
  /---[E2]---[7]            \------[4]
  |           |                     |
  |           |                     \----------[6]
  |           |
[ROOT]        \--------------------------------[8]
  |
  |                     /---------------------[10]
  |                     |
  \--------[E1]--------[9]
                        |
                        \---------------------[11]

In this example, the relationship between nodes [7] and [0] (the previous base of the tree) is reverted. The label of node [7] is automatically transferred to node [0]. This is consistent with the idea that internal node labels describe a property of the branch. The original label of the base, if it exists, is discarded. Since the branch between [7] and [9] is cut in two, the original label of node [9] is copied to node [7], leaving them both with the same label. However, if the outgroup is a terminal node, the label is not copied and the other basal branch is left without label.

Let L be the length of the branch from [7] and [9] in the original tree, and r the value of the parameter branch_split. The length of the branch [E1] will be set to rL, and the branch [E2] to (1-r)L. Overall, the length of the tree will not be modified.

In the case that reoriente=True, the final tree is rather:

 /------------------------------------------->[10]
 |
 |------------------------------------------->[11]
 |n
[9]       /----------------------------------->[8]
 |        |
 \------>[7]       /-------------------------->[1]
          |        |
          \------>[0]      /------------------>[3]
                   |       |
                   \----->[2]        /-------->[5]
                           |         |
                           \------->[4]
                                     |
                                     \-------->[6]

The topology of the tree is the same as the initial one, except that the base is now [9]. The lengths of all branches are conserved. However, node labels between the old and the new base are reverted: the node label of the new base ([9] in the example) is affected to the next node ([7] in the example) and so on until the old base ([0] in the example), whose labe, if it exists, is discarded.

Changed in version 3.0.0: Merged with reoriente().

total_length()

Compute the sum of all branch lengths across all nodes of the tree. All branch lengths must be defined (non-None), otherwise a ValueError will be raised.

unroot(reverse=False)

Remove the root. The tree must be initially unrooted (bifurcation at the root). This method removes the root node and places the base of the tree at one of the two basal nodes (the nodes that are ancestral to the two basal groups). This method does not change to total length of the tree. And error is raised if only one of the two basal branches has a length. If the initial basal node has a label, it is lost. If the node that becomes the base has a label, it is left there (it will appear a the base of the tree).

Parameters:reverse – if True, place the base of the tree at the second basal node (by default, the first basal node is used).

Helpers

class egglib._interface.DataBase

Base class for Align and Container.

This base class cannot be instanciated. Attempting to do so will raise an exception.

add_outgroup(name, data, group=None)

Add an outgroup sample to the instance.

Parameters:
  • name – name of the new sample.
  • data – an iterable (may be a string or a list) containing the data to set for the new sample. For an Align instance and if the sample is not the first, the number of data must fit any previous ones (ingroup or outgroup).
  • group – if not None, must be an unsigned integer (the default value is 0).
add_sample(name, data, groups=None)

Add a sample to the instance.

Parameters:
  • name – name of the new sample.
  • data – an iterable (may be a string or a list) containing the data to set for the new sample. For an Align instance and if the sample is not the first, the number of data must fit any previous ones (including outgroup samples, if any).
  • groups – if not None, must be an iterable with an unsigned integer for each group levels of the instance (if None, all group labels are set to 0), or a single integer value if only one level needs to be specified.
add_samples(items)

Add several samples at the end of the instance.

Parameters:items – items must be an iterable that have a length (for example a list, an Align instance or a Container instance. Each item of items must be of length 2 or 3. The first item must be the sample name string, the second item is the data values (as a list of signed integers or a single string) and the third item, if provided, is the list of group labels (one unsigned integer for each level). See the method add_sample() for more details about each added item. If the current instance is an Align, all added items must have the same length which must be the same as any items currently present in the instance. For a Container, the items may have different lengths. The number of group levels is set by the sample with the larger number of group labels, and all samples with less labels are completed by e default value (0).

New in version 2.0.1: Original name is addSequences().

Changed in version 3.0.0: Renamed as add_samples. All items are added in one shot (rather than calling the one-sample add method iteratively).

clear()

Clear the instance and release all memory. In most cases, it is preferable to use the method reset().

classmethod create(obj)

Create a new instance by copying date from the data container passed as obj. The object obj can be:

  • an Align,
  • a Container (but all sequences are required to have the same length if the target type is Align),
  • any iterable type yielding a acceptable items (see the documentation for add_samples() for more details).

New in version 2.0.1.

del_outgroup(index)

Delete an item of the instance.

Parameters:index – index of the sample to delete.
del_sample(index)

Delete an item of the instance.

Parameters:index – index of the sample to delete.
encode(nbits=10, random=None, include_outgroup=False)

Renames all sequences using a random mapping of names to unique keys of length nbits.

Parameters:
  • nbits – length of the keys (encoded names). This value must be >= 4 and <= 63.
  • random – a Random instance to be used a random generator. By default, use a default instance.
  • include_outgroup – a bool indicating whether the outgroup samples should also be considered (in this case, a single mapping is returned including both ingroup and outgroup samples).
Returns:

A dictionary mapping all the generated keys to the actual sequence names. The keys are case-dependent and guaranteed not to start with a number.

The returned mapping can be used to restore the original names using rename(). This method is not affected by the presence of sequences with identical names in the original instance (and rename() will also work properly in that case).

New in version 2.0.1.

Changed in version 3.0.0: Keys are forced to start with a capital letter. Now take a Random instance. Use library’s own random number generator. Added an option for the outgroup.

find(name, include_outgroup=False, regex=False, multi=False, flags=None, index=False)

Find a sample by its name.

Parameters:
  • name – name of sample to identify.
  • include_outgroup – a boolean indicating whether the outgroup should be considered.
  • regex – a boolean indicating whether the value passed a name is a regular expression. If so, the string is passed as is to the re module (using function re.search()). Otherwise, only exact matches will be considered.
  • multi – a boolean indicating whether all hits should be returned. If so, a list of SampleView instances is always returned (the list will be empty in case of no hits). Otherwise, a single SampleView instance (or its index) will be returned for the first hit, or None in case of no hits.
  • flags – list of flags to be passed to re.search() (ignored if regex is False). For example, when looking for samples containing the term “sample”, for being case insensitive, use the following syntax: align.find("sample", regex=True, flags=[re.I]). By default (None) no further argument is passed.
  • index – boolean indicating whether the index of the sample should be returned. In that case return values for hits are int (by default, SampleView) instances. Warning: it is not allowed to set both include_outgroup and index to True as there would be no way to distinguish between ingroup and outgroup indexes.
Returns:

None if no hits were found, a (potentially empty) list of SampleView instances or class:int if multi is True, or a single SampleView or int otherwise.

find_motif(sample, motif, start=0, stop=None)

Locate the first instance of a motif for an ingroup sample

Returns the index of the first exact hit to a given substring. The returned value is the position of the first base of the hit. Only exact matches are implemented. To use regular expression (for example to find degenerated motifs), one should extract the string for the sequence and use a tool such as the regular expression module (re).

Parameters:
  • sample – sample index.
  • motif – a list of integer, or one-character strings (or mixing both) or a string constituting the motif to search.
  • start – position at which to start searching. The method will never return a value smaller than start. By default, search from the start of the sequence.
  • stop – position at which to stop search (the motif cannot overlap this position). No returned value will be larger than stop-n. By default, or if stop is equal to or larger than the length of the sequence, search until the end of the sequence.
find_motif_o(sample, motif, start=0, stop=None)

Locate the first instance of a motif for an outgroup sample

Returns the index of the first exact hit to a given substring. The returned value is the position of the first base of the hit. Only exact matches are implemented. To use regular expression (for example to find degenerated motifs), one should extract the string for the sequence and use a tool such as the regular expression module (re).

Parameters:
  • sample – outgroup sample index.
  • motif – a list of integer, or one-character strings (or mixing both) or a string constituting the motif to search.
  • start – position at which to start searching. The method will never return a value smaller than start. By default, search from the start of the sequence.
  • stop – position at which to stop search (the motif cannot overlap this position). No returned value will be larger than stop minus the length of the motif. By default, or if stop is equal to or larger than the length of the sequence, search until the end of the sequence.
get_i(sample, site)

Get a data entry.

Parameters:
  • sample – sample index.
  • site – site index.
get_label(sample, level)

Get a group label.

Parameters:
  • sample – sample index.
  • level – level index.
get_label_o(sample)

Get an outgroup label.

Parameters:sample – sample index.
get_name(index)

Get the name of a sample.

Parameters:index – sample index.
get_name_o(index)

Get the name of an outgroup sample.

Parameters:index – outgroup sample index.
get_o(sample, site)

Get an outgroup data entry.

Parameters:
  • sample – outgroup sample index.
  • site – site index.
get_outgroup(index)

Get the SampleView instance corresponding to the requested index for the outgroup. The returned object allows to modify the underlying data.

Parameters:index – index of the outgroup sample to access.
get_sample(index)

Get the SampleView instance corresponding to the requested index. The returned object allows to modify the underlying data.

Parameters:index – index of the sample to access.
get_sequence(index)

Access to the data entries of a given ingroup sample. Returns a SequenceView instance which allow modifying the underlying sequence container object.

Parameters:index – ingroup sample index.
get_sequence_o(index)

Access to the data entries of a given outgroup sample. Returns a SequenceView instance which allow modifying the underlying sequence container object.

Parameters:index – outgroup sample index.
group_mapping(level=0, indices=False)

Generates a dictionary mapping group labels to either SampleView instances (by default) or their indexes representing all samples of this instance. It can process and ingroup label level or the outgroup.

Parameters:
  • level – index of the level to consider. If None, processes the outgroup.
  • indices – if True, represent samples by their index (within the ingroup or the outgroup depending on level being non-None or None) instead of SampleView instances.
is_matrix

True if the instance is an Align, and False if it is a Container.

iter_outgroup()

Iterator over the outgroup.

iter_samples(ingrp, outgrp)

Iterator over either ingroup samples or outgroup samples, or both ingroup and outgroup (in that order) samples. Provided for flexibility.

Parameters:
  • ingrp – include ingroup samples.
  • outgroup – include outgroup samples.
name_mapping()

Generates a dictionary mapping names to SampleView instances representing all samples of this instance. This method is most useful when several sequences have the same name. It may be used to detect and process duplicates. It processes the ingroup only.

names()

Generate the list of ingroup sample names.

names_outgroup()

Generate the list of outgroup sample names.

ng

Number of group levels for ingroup (for outgroup, this number is always one). It possible to change the value directly.

no

Current number of outgroup samples (cannot be modified directly).

ns

Current number of samples, not considering the outgroup (cannot be modified directly).

remove_duplicates()

Remove all duplicates, based on name exact matching. For all pairs of samples with identical name, only the one occurring first is conserved. The current instance is modified and this method returns None.

rename(mapping, liberal=False, include_outgroup=False)

Rename sequences of the instance using the provided mapping.

Parameters:
  • mapping – a dict providing the mapping of old names (as keys) to new names (which may, if needed, contain duplicated).
  • liberal – if this argument is False and a name does not appear in mapping, a ValueError is raised. If liberal is True, names that don’t appear in mapping are left unchanged.
  • include_outgroup – if True, consider both ingroup and outgroup samples together (they should be provided together in the same mapping).
Returns:

The number of samples that have been actually renamed, overall.

New in version 2.0.1.

Changed in version 3.0.0: Added an option for the outgroup. Added return value.

reserve(nsam=0, nout=0, lnames=0, ngrp=0, nsit=0)

Pre-allocate memory. This method can be used when the size of arrays is known a priori, in order to speed up memory allocations. It is not necessary to set all values. Values less than 0 are ignored.

Parameters:
  • nsam – number of samples in the ingroup.
  • nout – number of samples in the outgroup.
  • lnames – length of sample names.
  • ngrp – number of group labels.
  • nsit – number of sites.
reset()

Reset the instance.

set_i(sample, site, value)

Set an ingroup data entry. The value must be a signed integer or a one-character string. In the latter case, ord() is called on the behalf of the user.

Parameters:
  • sample – sample index.
  • site – site index.
  • value – allele value.
set_label(sample, level, value)

Set a group label.

Parameters:
  • sample – sample index.
  • level – level index.
  • value – new group value (unsigned integer).
set_label_o(sample, value)

Get an outgroup label.

Parameters:
  • sample – sample index.
  • value – new group value (unsigned integer).
set_name(index, name)

Set the name of a sample.

Parameters:
  • index – index of the sample
  • name – new name value.
set_name_o(index, name)

Set the name of an outgroup sample.

Parameters:
  • index – index of the outgroup sample
  • name – new name value.
set_o(sample, site, value)

Set an outgroup data entry. The value must be a signed integer or a one-character string. In the latter case, ord() is called on the behalf of the user.

Parameters:
  • sample – sample index.
  • site – site index.
  • value – allele value.
set_outgroup(index, name, data, group=None)

Set the values for a sample of the outgroup

Parameters:
  • index – index of the sample to access (slices are not permitted).
  • name – new name of the sample.
  • data – string or list of integers given the new values to set. In case of an Align, it is required to pass a sequence with length matching the number of sites of the instance.
  • groups – a positive integer to serve as group label (the default value is 0.
set_sample(index, name, data, groups=None)

Set the values for the sample corresponding to the requested index.

Parameters:
  • index – index of the sample to access (slices are not permitted).
  • name – new name of the sample.
  • data – string or list of integers given the new values to set. In case of an Align, it is required to pass a sequence with length matching the number of sites of the instance.
  • groups – a list of integer label values, or a single integer value, to set as group labels. The class will ensure that all samples have the same number of group labels, padding with 0 as necessary. The default corresponds to an empty list.
set_sequence(index, value)

Replace all data entries for a given ingroup sample by new values.

Parameters:
  • index – ingroup sample index.
  • value – can be a SequenceView instance, a string or a list of integers (or any iterable with a length which may mix single-character strings and integers). When modifying an Align instance, all sequences must have the same length as the current alignment length.
set_sequence_o(index, value)

Replace all data entries for a given outgroup sample by new values.

Parameters:
  • index – outgroup sample index.
  • value – can be a SequenceView instance, a string or a list of integers (or any iterable with a length which may mix single-character strings and integers). When modifying an Align instance, all sequences must have the same length as the current alignment length.
shuffle(level=0, random=None)

Shuffle group labels.

Randomly reassigns group labels. Modifies the current instance and returns None. Only the specified level is affected, and only the group labels are modified (the order of samples is not changed.

Parameters:
  • level – index of the group level to shuffle. To shuffle the outgroup’s labels, set level to None.
  • random – A Random instance to be used a random generator. By default, use a default instance.

Changed in version 3.0.0: The outgroup is necessarily processed separately from the rest of the instance. Allow pass a Random instance. Use library’s own random number generator.

subgroup(groups, outgroup_all=False)

Generate and return a copy of the instance with only samples from the specified groups (identified by their group labels, including outgroup).

Parameters:
  • groups – an integer or a dictionary providing the labels of the groups that are selected. If an integer, it is understood as an ingroup label corresponding to the first grouping level (if more than one). If a dictionary, the keys of this dictionary must be integer corresponding to grouping levels (or None for the outgroup) and the values must be lists containing the requested labels. It is not required to include all levels and the outgroup in the dictionary. It is allowed to specify labels that are actually not represented in the data.
  • outgroup_all – if True, always include all outgroup samples in the returned instance, regardless of whether their group labels are specified in the first argument.

New in version 3.0.0.

subset(samples, outgroup=None)

Generate and return a copy of the instance with only a specified list of samples. It is possible to select ingroup and/or outgroup samples and the sample indexes are not required to be consecutive.

Parameters:
  • samples – a list (or other iterable type with a length) of sample indexes giving the list of ingroup samples that must be exported to the return value object. If None, do not export any ingroup samples.
  • outgroup – a list (or other iterable type with a length) of sample indexes giving the list of outgroup samples that must be exported to the return value object. If None, do not export any outgroup samples.

New in version 3.0.0: Established as a method for Align and Container.

to_fasta(fname=None, first=0, last=4294967295, mapping=None, groups=False, shift_labels=False, include_outgroup=True, linelength=50)

Export alignment in the fasta format.

Parameters:
  • fname – Name of the file to export data to. By default, the file is created (or overwritten if it already exists). If the option append is True, data is appended at the end of the file (and it must exist). If fname is None (default), no file is created and the formatted data is returned as a str. In the alternative case, nothing is returned.
  • first – If only part of the sequences should be exported: index of the first sequence to export.
  • last – If only part of the sequences should be exported: index of the last sequence to export. If the value is larger than the index of the last sequence, all sequences are exported until the last (this is the default). If last*<*first, no sequences are exported.
  • mapping – A string providing the character mapping. Use the specified list of characters to map integer allelic values. If a non-empty string is provided, the length of the string must be larger than the largest possible allele values. In that case, the allele values will be used as indexes in order to determine which character from this string must be used for outputting. In the case that this method is used with an empty string, the mapping will not be used and the allele values will be casted directly to characters.
  • groups – A boolean indicating whether group labels should be exported, or ignored.
  • shift_labels – A boolean indicating whether group labels should be incremented of one unit when exporting.
  • include_outgroup – A boolean indicating whether the outgroup should be exported. It will be exported at the end of the ingroup, and without discriminating label if include_labels is False.
  • linelength – The length of lines for internal breaking of sequences.
to_outgroup(index, label=0)

Transfer a sample to the outgroup.

Parameters:
  • index – index of the sample to move to the outgroup.
  • label – label to assign to the sample (usually, this label assigned the sample to an individual).
class egglib.SampleView(parent, index, outgroup)

This class manages the name, sequence and groups of an item of the Align and Container classes. SampleView objects allow iteration and general manipulation of (large) data sets without unnecessary extraction of full sequences. Modifications of SampleView objects are immediately applied to the underlying data holder object. SampleView objects are iterable and allow indexing (the values are: name, sequence, group, in that order).

In principle, only Align and Container instances are supposed to build SampleView instances.

Parameters:
  • parent – a Align or Container instance.
  • index – an index within the parent instance.
  • outgroup – a boolean indicating whether the instance represents an outgroup (rather than ingroup) sample.

New in version 2.0.1.

Changed in version 3.0.0: Renamed from SequenceItem. No more index check at construction. No more string formatting. Supports multiple groups. Sequence managed by SequenceView and groups by GroupView.

group

Access to group labels. Returns a GroupView instance that can be modified. The values used for assignment may be a GroupView, a list of integers or a single integer. If there are less values defined than the number of group levels, the values after the last one are left unchanged.

index

Index of the sample in the parent (Align or Container instance containing this sample).

ls

Length of the sequence for this sample.

name

Get or set the sample name.

outgroup

True if the sample is part of the outgroup. This value cannot be modifed.

parent

Reference of the parent instance (Align or Container instance containing this sample).

sequence

Access to data entries. This attribute is represented by a SequenceView instance which allow modifying the underlying sequence container object. It is possible to set this attribute using either a string, a list of integers (it can be any iterable with a length and it may mix single-character strings and integers to represent individual data entries). When modifying an Align instance, all sequences must have the same length as the current alignment length.

class egglib.SequenceView(parent, index, outgroup)

This class manages the sequence of an item of the Align and Container classes. Supports iteration and random access (including with slices) which is the preferred way if sequences are large because it prevents extraction of the full sequence.

In principle, only SampleView, Align and Container instances are supposed to build SequenceView instances.

Parameters:
  • parent – a Align or Container instance.
  • index – an index within the parent instance.
  • outgroup – a boolean indicating whether the instance represents an outgroup (rather than ingroup) sample.

New in version 3.0.0.

find(motif, start=0, stop=None)

Locate the first instance of a motif.

Returns the index of the first exact hit to a given substring. The returned value is the position of the first base of the hit. Only exact matches are implemented. To use regular expression (for example to find degenerated motifs), one should extract the string for the sequence and use a tool such as the regular expression module (re).

Parameters:
  • motif – a list of integer, or one-character strings (or mixing both) or a string constituting the motif to search.
  • start – position at which to start searching. The method will never return a value smaller than start. By default, search from the start of the sequence.
  • stop – position at which to stop search (the motif cannot overlap this position). No returned value will be larger than stop-n. By default, or if stop is equal to or larger than the length of the sequence, search until the end of the sequence.
insert(position, values)

Insert data entries.

This method is only available for samples belonging to a Container instance. For Align instances, (for which it is possible to insert data entries to all ingroup and outgroup samples) use the method insert_columns().

Parameters:
  • position – the position at which to insert sites. The new sites are inserted before the specified index. Use 0 to add sites at the beginning of the sequence, and the current number of sites for this sample to add sites at the end. If the value is larger than the current number of sites for this sample, or if None is provided, new sites are added at the end of the sequence.
  • values – a list of integers or a string containing the data entries to insert in the sequence. It is allowed to mix integers and one-character strings in a list or use other types provided that they are iterable and have a length
string()

Generate a string from all data entries. All allele values must be >= 0. Even if this condition is met, the generated string might be non printable.

strip(values, left=True, right=True)

Delete leading and/or trailing occurrences of any characters given in the values argument. The underlying object is modified and this method returns None.

Parameters:
  • valuesvalues should be a list of integers but can also be a string.
  • left – A bolean indicating whether left-side characters should be stripped.
  • right – A bolean indicating whether left-side characters should be stripped.
to_lower()

Converts all allele values of this sample to lower case. More specifically, transforms all values lying in the range A-Z to their equivalent in the range a-z. All other allele values are ignored. The underlying object data is modified and this method returns None.

to_upper()

Converts all allele values of this sample to upper case. More specifically, transforms all values lying in the range a-z to their equivalent in the range A-Z. All other allele values are ignored. The underlying object data is modified and this method returns None.

class egglib.GroupView(parent, index, outgroup)

This class manages the list of group labels of an item of the Align and Container classes. This class can be represented a list of unsigned integers, is iterable and allows random access (read and write) but not slices.

In principle, only SampleView, Align and Container instances are supposed to build GroupView instances.

Parameters:
  • parent – a Align or Container instance.
  • index – an index within the parent instance.
  • outgroup – a boolean indicating whether the instance represents an outgroup (rather than ingroup) sample.

New in version 3.0.0.

outgroup

True if the sample is part of the outgroup.

class egglib.Node(label=None)

This class provides an interface to a Tree instance’s nodes and allows access and modification of data attached to a given node as well as the tree descending from that node. A node must be understood as the point below a branch. Branches (connections between nodes) have a direction: they go from a node to another node. Nodes have therefore children and parents (a given node can have one, or possibly no, parent). Connecting a node to itself, making a two-way branch (to branches connecting the same two nodes in opposite directions) or duplicate branches (between the same two nodes and in the same direction) are illegal.

Parameters:label – node label (in case of a terminal node, its leaf label), if needed. Labels, if provided, are expected to be strings for terminal nodes, and numeric values for internal nodes, but technically, all user-supplied values are accepted (however, some Tree methods require proper types).
branch_to(child)

Get the length of the branch to one of this node’s children. Non-specified branch lengths are represented by None.

Parameters:child – nodes whose branch should be returned. It can be represented by a direct reference (as a :class:~.Node`), or by its index in the children list. In that case, ensure that the index is currently valid.
child(idx)

Return a given child, as a Node instance.

children()

Return an iterator over this node’s children.

Changed in version 3.0.0: Returns an iterator instead of a list.

has_descendant(node)

Return a boolean indicating whether the Node passed as argument can be found amongst the descendants (children, children of children, and so on down to the leaves) of this node.

is_child(node)

Return True if the Node instance passed as argument is one of the children of the current node.

is_parent(node)

Return True if the Node instance passed as argument is the parent of the current node. As a side effect, passing None to an instance that has no parent will return True.

label

Node’s label (modifiable).

leaves_down()

Recursively gets all leaf labels descending from that node. If this is a terminal node, returns its label in a one-item list.

leaves_up()

Recursively gets all leaf labels contained on the other side of the tree (that is, all leaves except those descending from this node). If this is the root node, returns an empty list).

newick(labels=True, brlens=True)

Formats the node and the subtree descending from is as a newick string. If labels is False, omit internal branch labels. If brlens is False, omit branch lengths.

num_children

Number of children connected to this node.

parent

Reference to the parent of the current node (None if the node has no parent).

parent_branch

Length of the branch to this node’s parent. Non-specified branch lengths are represented by None. An exception is thrown if this node has no parent. This attribute can be modified.

set_branch_to(child, brlen)

Set the length of the branch to one of this node’s children.

Parameters:
  • child – node whose branch should be resized. It can be represented by a direct reference (as a :class:~.Node`), or by its index in the child list. In that case, ensure that the index is currently valid.
  • brlen – new branch length (it is allowed to pass None).
siblings()

List of other children of this node’s parent. It is required that this node has a parent.