Generic tools

Discretize

class egglib.tools.Discretize(*args, **kwargs)

This class discretizes a continuous distribution based on a given number of samples with an arbitrary number of dimensions.

The default constructor builds an empty instance which can process data using process(). It is possible to process data directly at object construction time by passing arguments to the constructor. In that case, arguments must match the syntax of the process() method.

New in version 3.0.0.

bounds

Bounds of the final distribution. By default (if no data have been processed), returns None. If data have been processed, returns a list of length matching ndim containing (min, max) tuples giving the extreme values for each dimension. If bounds have been set, returns a copy of the passed value regardless of whether data have been processed or not.

get(*idx)

Get a frequency value from discretized data. Positional arguments should be the categories indexes (one index per dimension). Passing invalid indexes or an invalid number of index values, or calling this method on an instance that does not contain data, result in a ValueError. Indexes can be negative (to count from the end). Slicing is not supported.

left(dim)

Get the list of category left points (minimum of each category) for a given dimension. Raises a ValueError if no data are present in the instance or if the index is out of bounds.

mget(*idx)

Get marginal values. This method is equivalent to get() except that it can integrate values over any dimension. To obtain marginal sums for a given set of dimensions (one or more) one should replace all other indexes by None. This method requires, as get() does, one argument per dimension. If argument are provided they must be proper indexes. It is possible to replace any index by None, including several at the same time, all of them (equivalent to ndata) and none of them (equivalent to get()).

mid(dim)

Get the list of category midpoints for a given dimension. Raises a ValueError if no data are present in the instance or if the index is out of bounds.

ncat

Number of categories of the processed data set. The value is None if no data set has been processed. Otherwise the value is necessarily at least 1.

ndata

Number of data set items of the processed data set. The value is None if no data set has been processed. If out of bounds values were allowed, and did actually occur, this value will be the number of data set items that were included and used.

ndim

Number of dimensions of the processed data set. The value is None if no data set has been processed.

nskipped

Number of data set items that were skipped because they were out of bounds (only if non-default bounds were used). The value is None is no data set has been processed.

process(data, ncat, bounds=None, allow_oob=False)

Process a data set and generate a discretized distribution.

Parameters:
  • data – a sequence of fixed-length sequences of numeric values. Typically, one will pass a list of d-length lists or tuples containing floats (but integers are also supported), where d is the number of dimensions of the distribution. It is required that all items of data have the same length, and that this length is at least 1. The number of dimensions is read from the first item of data.
  • ncat – numbers of categories used for binarization. The value can be a single integer or a sequence of integers whose length matches the number of dimensions of the data set. If an integer is passed, it must be a strictly positive integer and it will be used as the number of categories for all dimensions. Otherwise, each value describes the number of categories for each corresponding dimension. There is no default; this argument is required.
  • bounds – bounds of the distribution given as minimum and maximum values for all dimensions. The position of category limits are determined by these bounds. The value can either be (1) None, (2) a sequence of two integers, or (3) a sequence of sequences of two integers. In case (1), the actual minimum and maximum values of all dimensions are used a distribution bounds. In case (2), the values are used as minimum and maximum (respectively) for all dimensions. In case (3), the length of the sequence must be equal to the number of dimensions. Each item is used to define the minimum and maximum (respectively) for the corresponding dimension. In addition, it is possible to replace any item of this sequence None to specify that the actual minimum and maximum values must be used as distribution bounds for the corresponding dimensions. By default, the actual extreme values of the data are used. In that case, it is guaranteed that there will be no out of bounds values.
  • allow_oob – a boolean indicating what to do if a value is out of bounds for any of the dimensions. This can happen only if the argument bounds has been set to a non-default value. If out of bounds values are allowed, they will be ignored and a counter will be incremented. Otherwise, a ValueError will be raised on the first occurrence of an out of bound value.
reset()

Clear all data and reset the object to default settings.

right(dim)

Get the list of category right points (maximum of each category) for a given dimension. Raises a ValueError if no data are present in the instance or if the index is out of bounds.

Random

class egglib.tools.Random(seed=None)

Pseudo-random number generator.

This class implements the Mersenne Twister algorithm for pseudo-random number generation. It is based on work by Makoto Matsumoto and Takuji Nishimura and Jasper Bedaux for the core generator, and the Random class of Egglib up to 2.2 for conversion to other laws than uniform.

Note that different instances of the class have independent chains of pseudo-random numbers. If several instances have the same seed, they will generate the exact same chain of pseudo-random numbers. Note that this applies if the default constructor is used and that instances are created within the same second.

All non-uniform distribution laws generators are based either on the integer_32bit() or the standard (half-open, 32 bit) uniform() methods.

Parameters:seed – The constructor accepts an optional integer value to seed the pseudo-random number generator sequence. By default, the current system clock value will be used, which means that all instances created within the same second will generate strictly identical sequences. Favor large, high-complexity seeds. When using different instances of this class in a program, or different instances of the same program launched simultaneously, ensure they are all seeded using different seeds.
bernoulli(p)

Draw a boolean with given probability.

Parameters:p – probability of returning True.
Returns:A boolean.
binomial(n, p)

Draw a value from a binomial distribution.

Parameters:
  • n – Number of tests (>=0).
  • p – Test probability (>=0 and <=1).
Returns:

An int (number of successes).

boolean()

Draw a boolean with equal probabilities (p = 0.5).

Returns:A boolean.
exponential(expectation)

Draw a value from an exponential distribution.

Parameters:expectation – Distribution’s mean (equal to 1/\lambda , if \lambda is the rate parameter). Required to be >0.
Returns:A long.
geometric(p)

Draw a value from a geometric distribution.

Parameters:p – Geometric law parameter (>0 and <=1).
Returns:A positive int.
get_seed()

Get the seed value (value used at object creation, or value used to reset the instance with set_seed().

Returns:A long.
integer(n)

Draw an integer from a uniform distribution.

Parameters:n – Number of possible values). Note that this number is excluded and will never be returned. Required to be a stricly positive integer.
Returns:An int in range [0, n-1].
integer_32bit()

Generate a 32-bit random integer.

Returns:An long in the interval [0, 4294967295] (that is in the interval [0, 2^32-1].
normal()

Draw a value from the normal distribution with expectation 0 and variance 1. The expression rand.normal() * sd + m can be used to rescale the drawn value to a normal distribution with expectation m and standard deviation sd.

Returns:A float.
poisson(p)

Draw a value from a Poisson distribution.

Parameters:p – Poisson distribution parameter (usually noted \lambda). Required to be >0
Returns:A positive int.
set_seed(seed)

Reset the instance by providing a seed value. See the comments on the seed value in the class description.

uniform()

Generate a float in the half-open interval [0,1) with default 32-bit precision. The value 1 is not included.

uniform_53bit()

Generate a float in the half-open interval [0,1) with increased 53-bit precision. The value 1 is not included.

The increased precision increases the number of possible values (2^53 = 9007199254740992 instead of 2^32 = 4294967296). This comes with the cost of increased computing time.

uniform_closed()

Generate a float in the closed interval [0,1] with default 32-bit precision. Both limits are included.

uniform_open()

Generate a float in the open interval (0,1) with default 32-bit precision. Both limits are excluded.

ReadingFrame

class egglib.tools.ReadingFrame(frame=None)

Handles reading frame positions. The reading frame positions can be loaded as constructor argument or using the method process(). By default, builds an instance with no exons. If the argument is specified, it must be identical to the argument to the method process().

Changed in version 3.0.0: Previously, bases from truncated codons were discarded; they are not included as part of partial codons. Functionality extended.

codon_bases(codon)

Give the position of the three bases of a given codon. One or two positions (but never the middle one alone) will be None if the codon is truncated (beginning/end of an exon without coverage of the previous/next one).

Parameters:codon – any codon index.
Returns:A tuple with the three base positions, potentially containing one or two None, or, instead of the tuple, None if the codon index is out of range.
codon_index(base)

Find the codon in which a given base falls.

Parameters:base – any base index.
Returns:The index of the corresponding codon, or None if the base does not fall in any codon.
codon_position(base)

Tell if the given base is the 1st, 2nd or 3rd position of the codon in which it falls.

Parameters:base – any base index.
Returns:The index of the base in the codon (0, 1 or 3), or None if the base does not fall in any codon.
exon_index(base)

Find the exon in which a given base falls.

Parameters:base – any base index.
Returns:The index of the corresponding exon, or None if the base does not fall in any exon.
iter_codons(skip_partial=False)

This iterator returns (first, second, third) tuples of the positions of the three bases of each codon. If skip_partial is False, partial codons (containing one or two None for bases that are in non-covered exons) are included.

Parameters:skip_partial – tells if codons containing one or two non-represented bases should be included.
iter_exon_bounds()

This iterator returns (start, stop) tuples of the positions of the limits of each exon.

num_codons

Number of codons (including truncated codons).

num_exon_bases

Number of bases in exons.

num_exons

Number of exons.

num_full_codons

Number of full (non-truncated) codons.

num_needed_bases

Number of bases needed for a sequence to apply this reading frame. In practice, the value equals to the end of the last exon plus one, or zero if there is no exon. If the reading frame is used with a shorted sequence, it can lead to errors.

num_tot_bases

Total number of bases (starting from the start of the first exon up to end of the last one).

process(frame)

Load a reading frame. All previously loaded data are discarded.

Parameters:frame

the reading frame specification must be a sequence of (start, stop[, codon_start]) pairs or triplets where start and stop give the limits of an exon, such that sequence[start:stop] returns the exon sequence, and codon_start, if specified, can be:

  • 1 if the first position of the exon is the first position of a codon (e.g. ATG ATG),
  • 2 if the first position of the segment is the second position of a codon (e.g. TG ATG),
  • 3 if the first position of the segment is the third position a of codon (e.g. G ATG),
  • None if the reading frame is continuing the previous exon.

If codon_start of the first segment is None, 1 will be assumed. If codon_start of any non-first segment is not None, the reading frame is supposed to be interupted. This means that if any codon was not completed at the end of the previous exon, it will remain incomplete.

Sequence manipulation tools

IUPAC nomenclature

The nomenclature for ambiguity characters is listed in the following table:

Symbol Possible values Complement
A Adenine T
C Cytosine G
G Guanine C
T Thymine A
M A or C K
R A or G Y
W A or T W
S C or G S
Y C or T R
K G or T M
B C, G or T V
D A, G or T H
H A, C or T D
V A, C or G B
N A, C, G or T N
- Alignment gap -
? Any of A, C, G, T, or - ?

concat()

egglib.tools.concat(align1, align2, ..., spacer=0, ch='?', group_check=True, no_missing=False, ignore_names=False, dest=None)

Concatenates sequence alignments provided as Align instances passed as arguments to this function. A unique Align is produced. All different sequences from all passed alignments are represented in the final alignment. Sequences whose name match are matching are concatenated. In case several sequences have the same name in a given segment, the first one is considered and others are discarded. In case a sequence is missing for a particular segment, a stretch of non-varying characters is inserted to replace the unknown sequence.

All options (excluding the alignements to be concatenated) must be specified as keyword arguments, otherwise they will be treated as alignments, which may generate an error.

Parameters:
  • align2 (align1,) – Two or more Align instances (their order is used for concatenation. It is not allowed to specify them using the keyword syntax.
  • spacer – Length of unsequenced stretches (represented by non-varying characters) between concatenated alignments. If spacer is a positive integer, the length of all stretches will be identical. If spacer is an iterable containing integers, each specifying the interval between two consecutive alignments (if aligns contains n alignments, spacer must be of length n-1).
  • ch – Character to used for conserved stretches and for missing segments.
  • group_check – If True, an exception will be raised in case of a mismatch between group labels of different sequence segments bearing the same name. Otherwise, the group of the first segment found will be used as group label of the final sequence.
  • no_missing – If True, an exception will be raised in case the list of samples differs between Align instances. Then, the number of samples must always be the same and all samples must always be present (although it is possible that they consist in missing data only). Ignored if ignore_names is True.
  • ignore_names – Don’t consider sample names and concatenate sequences based on they order in the instance. Then the value of the option no_missing is ignored and the number of samples is required to be constant over alignments.
  • dest – An optional Align instance where to place results. This instance is automatically reset, ignoring all data previously loaded. If this argument is not None, the function returns nothing and the passed instance is modified. Allows to recycle the same object in intensive applications.
Returns:

If dest is None, a new Align instance. If dest is None, this function returns None.

Note

Outgroup samples are processed like ingroup samples, but they are processed independently. In particular, they may have the same name as ingroup samples but will not be skipped.

New in version 2.0.1: The arguments allowing to customize function’s behaviour.

Changed in version 3.0.0: Major interface change: the alignments are not passed as a list as before, but as a variable-length list of positional arguments, implying that options must be specified as keyword arguments only. Besides, added options no_missing, ignore_names and dest. Renamed option groupCheck as group_check. Removed option strict (now name comparison is always strict).

ungap()

egglib.tools.ungap(align, gap='-', triplets=False)

Generate a new Container instance containing all sequences of the provided alignment with all gaps removed. See documentation for arguments below.

Returns:A Container instance.
egglib.tools.ungap(align, freq, consider_outgroup=False, gap='-', triplets=False)

Generate a new Align instance containing all sequences of the provided alignment but with only those sites for which the frequency of alignment gaps (by default, the - symbol) is less than or equal to freq.

Parameters:
  • freq – minimum gap frequency in a site (if there are more gaps, the site is not included in the returned Align instance). This argument is required (otherwise, a Container is returned (see above). This value is a relative frequency (included in the [0, 1] range).
  • consider_outgroup – if True, consider the outgroup when computing the frequency of gaps. The outgroup sequences are always included in the returned Align irrespective to the value of this option.
  • gap – the character representing gaps. It is allowed to pass the allele value as a single-character str, a single-character unicode, and as a int.
  • triplets – process codon sites (triplets of three consecutive sites) instead of individual sites. A triplet is considered to be missing if at least one of the three base is a gap. If a codon site has too many (that is, at a frequency larger than freq) missing triplets, it is completely remove. If True, the length of the alignment is required to be a multiple of 3.
Returns:

An Align instance.

Changed in version 2.1.0: Added option includeOutgroup.

Changed in version 3.0.0: Option includeOutgroup is renamed, and its default value is changed to False. Merged with previous functions ungap_all() and ungap_triplets(). If a site has a frequency of gaps equal to freq, it is kept (previously it was removed). Added gap option.

rc()

egglib.tools.rc(seq)

Reverse-complement a DNA sequence.

Parameters:seq – input nucleotide sequence, as a str, or a :class: SequenceView.
Returns:The reverse-complemented sequence (see details below).

The case of the provided sequence is preserved. Ambiguity characters are complemented as described in the IUPAC nomenclature. Invalid characters characters raise a ValueError.

Changed in version 2.0.1: Characters N, - and ? are correctly processed.

Changed in version 2.0.2: Reimplemented (will be faster for large sequences).

Changed in version 3.0.0: The case of the original sequence is preserved.

backalign()

egglib.tools.backalign(nucl, aln, code=1, smart=False, ignore_names=False, ignore_mismatches=False, fix_stop=False)

Align coding sequence based on the corresponding protein alignment.

Parameters:
  • nucl – a Container or Align instance containing coding DNA sequences that should be aligned. Codons containing an ambiguity IPUAC character or ? (see IUPAC nomenclature) are translated as X. All alignment gaps (character -) will be stripped from the sequences.
  • aln – a :class::.Align instance containing an alignment of the protein sequences encoded by the coding sequences provided as value for the nucl argument. If there is an outgroup in nucl, it must also be present in aln. Group labels of aln are not taken into account.
  • code – genetic code identifier (see Available genetic codes). Required to be an integer among the valid values. The default value is the standard genetic code.
  • smart – “smart” translation.
  • ignore_names – if True, ignore names for matching sequences in the protein alignment to coding sequences. Sequences will be matched using their rank and the names in the returned alignment will be taken from nucl.
  • fix_stop – if True, support a single trailing stop codon in coding sequences not represented by a * in the provided protein alignment (such as if the final stop codons have been stripped during alignment). If found, this stop codon will be flushed as left as possible (immediately after the last non-gap character) in the returned coding alignment.
  • ignore_mismatches – if True, do not generate any exception if a predicted protein does not match the provided protein sequence (if the lengths differ, an exception is always raised).
Returns:

A Align instance containing aligned coding DNA sequence (including the outgroup)

If a mismatch is detected between a protein from aln and the corresponding prediction from nucl, an instance of BackalignError (a subclass of exceptions.ValueError) is raised. The attribute BackalignError.alignment can be used to help identify the reason of the error. Mismatches (but not differences of length) can be ignored with the option ignore_mismatches.

class egglib.tools.BackalignError(name, fnuc, faln, i_nuc, i_aln, ls_aln, ls_nuc, translator)

Bases: exceptions.ValueError

Subclass of ValueError used to report errors occurring during the use of backalign() because of mismatches between the provided alignment and predicted proteins.

alignment

String representing the alignment of the provided and predicted proteins, incorporating the following codes in the middle line:

  • |: match.
  • #: mismatch.
  • ~: one protein shorter. differ).
name

Name of sequence for which the error occurred.

compare()

egglib.tools.compare(seq1, seq2)

Compare two sequences. The comparison supports ambiguity characters (see IUPAC nomenclature) such that partially overlapping ambiguity sets characters are not treated as different. For example, A and M are not treated as different, nor are M and R. Only IUPAC characters may be supplied.

Parameters:
  • seq1 – a DNA sequence as a str or a :class: SequenceView.
  • seq2 – another DNA sequence as a str or a :class: SequenceView.
Returns:

True if sequences have the same length and they either are identical or differ only by overlapping IUPAC characters, False otherwise.

regex()

egglib.tools.regex(query, both_strands=False)

Turn a DNA sequence into a regular expression. The input sequence should contain IUPAC characters only (see IUPAC nomenclature). Ambiguity characters will be teated as follows: an ambiguity will match either (1) itself, (2) one of the characters it defines, or (3) one of the ambiguity characters that define a subset of the characters it defines. For example, a M in the query will match an A, a C or a M, and a D will match all the following: A, G, T, R, W, K, and D. The fully degenerated N matches all four bases and all intermediate ambiguity characters and, finally, ? matches all allowed characters (including alignment gaps). - only matches itself.

Result of this function can be used with the module re to locate occurrences of a motif or the position of a sequence as in:

regex = egglib.tools.regex(query)
for hit in re.finditer(regex, subject):
    print hit.start(), hit.end(), hit.group(0)

The returned regular expression includes upper-case characters, regardless of the case of input characters. To perform a case-insensitive search, use the re.IGNORECASE flag of the re module in regular expression searches (as in re.search(egglib.tools.regex(query), subject, re.IGNORECASE). Note that the regular expression is contained into a group if both_strands is False, and two groups otherwise (one for each strand).

Parameters:
  • query – a str containing IUPAC characters or a :class: SequenceView.
  • both_strands – look for the query on both forward and reverse strands (by default, only on forward strand).
Returns:

A regular expression as a str expanding ambiguity characters to all compatible characters.

New in version 3.0.0.

motif_iter()

egglib.tools.motif_iter(subject, query, mismatches=0, both_strands=False, case_independent=True, only_base=True)

Return an iterator over hits of the provided query in the subject sequence. The query should only contain bases (including IUPAC ambiguity characters as listed in IUPAC nomenclature, but excluding - and ?). Ambiguity characters are treated as described in regex() (the bottom line is that ambiguities in the query can only match identical or less degenerate ambiguities in the subject). Mismatches are allowed, and the iterator will first yield identical hits (if they exist), then hits with one mismatch, then hits with two mismatches, and so on.

Parameters:
  • subject – a DNA sequence as str. The sequence may contain ambiguity characters. In principle, the subject sequence should contain IUPAC characters only, but this is not strictly enforced (non-IUPAC characters will always be treated as different of IUPAC characters).
  • query – a DNA sequence as str, also containing only IUPAC characters only.
  • mismatches – maximum number of mismatches.
  • both_strands – look for the query on both forward and reverse strands (by default, only on forward strand).
  • case_independent – if True, perform case-independent searches.
  • only_base – if True, never create a hit if the putative hit sequence contains an alignment gap (-) or a ? character, irrespective of the number of allowed mismatches.
Returns:

An iterator returning, for each hit, a (start, stop, strand, num) tuple where start and stop are the position of the hit such that subject[start:stop] returns the hit sequence, strand is the strand as + or -, and num is the number of mismatches of the hit.

Note

If a given query has a hit on both strands at the same position (this only happens if the sequence is a motif equal to its reverse-complement like AATCGATT), it is guaranteed that the only one hit will be returned for a given position (no twice on each strand). However, the value of the strand flag is not defined.

Warning

This function is designed to be efficient only if the number of mismatches is small (only a few). A combination of a large number of mismatches (more than 2 or 3) and a large query sequence (as early as 10 or 20), will take significant time to complete. More complex problems require the use of genuine pairwise ocal alignment tools.

Tools based on the genetic code

Available genetic codes

All genetic codes defined by the National Center for Biotechnology Information are supported and can be accessed using codes compatible with the GenBank /trans_table qualifier. The codes are integers.

Identifier Code
1 The Standard Code
2 The Vertebrate Mitochondrial Code
3 The Yeast Mitochondrial Code
4 The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code
5 The Invertebrate Mitochondrial Code
6 The Ciliate, Dasycladacean and Hexamita Nuclear Code
9 The Echinoderm and Flatworm Mitochondrial Code
10 The Euplotid Nuclear Code
11 The Bacterial, Archaeal and Plant Plastid Code
12 The Alternative Yeast Nuclear Code
13 The Ascidian Mitochondrial Code
14 The Alternative Flatworm Mitochondrial Code
16 Chlorophycean Mitochondrial Code
21 Trematode Mitochondrial Code
22 Scenedesmus obliquus Mitochondrial Code
23 Thraustochytrium Mitochondrial Code
24 Pterobranchia Mitochondrial Code
25 Candidate Division SR1 and Gracilibacteria Code
26 Pachysolen tannophilus Nuclear Code
27 Karyorelict Nuclear
28 Condylostoma Nuclear
29 Mesodinium Nuclear
30 Peritrich Nuclear
31 Blastocrithidia Nuclear

Note that the following code identifiers do not exist: 7, 8, 15, 17, 18, 19, and 20, as well as 0 and values above 31.

Reference: National Center for Biotechnology Information [http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi]

Translator

class egglib.tools.Translator(code=1, smart=False, delete_truncated=False)

Class providing methods to translate nucleotide (DNA) sequences to proteins.

Parameters:
  • code – genetic code identifier (see Available genetic codes). Required to be an integer among the valid values. The default value is the standard genetic code.
  • smart – “smart” translation.
  • delete_truncated – if True, codons that are truncated (either because the reading frame is 5’, internally, or 3’ partial, or if the number of bases is not a multiple of three) are skipped when translated. By default, truncated codons are translated as X.

New in version 3.0.0.

delete_truncated

Value of the delete_truncated option. The value can be modified.

translate_align(align, frame=None, allow_alt=False, in_place=False)

Translate a Align instance.

Parameters:
  • align – a Align containing DNA sequences.
  • frame – a ReadingFrame instance providing the exon positions in the correct frame. By default, a non-segmented frame covering all sequences is assumed (in case the provided alignment is the coding region).
  • allow_alt – a boolean telling whether alternative start (initiation) codons should be considered. If False, codons are translated as a methionine (M) if, and only if, there are among the alternative start codons for the considered genetic code and they appear at the first position for the considered sequence (excluding all triplets of gap symbols appearing at the 5’ end of the sequence). With this option, it is required that all sequences start by a valid initiation codon unless the first codon is partial or contains invalid data (in such cases, it is ignored).
  • in_place – place translated sequences into the original Align instance (this discards original data). By default, returns a new instance.
Returns:

By default, an original Align instance containing translated (protein) sequences. If in_place was True, return None.

translate_codon(first, second, third)

Translate a single codon based on the genetic code defined at construction time.

Parameters:
  • first – first base of the codon as a one-character string.
  • second – second base of the codon as a one-character string.
  • third – third base of the codon as a one-character string.
Returns:

The one-letter amino acid code if the codon can be translated, ‘-‘ if the codon is ‘—’, X otherwise (including all cases with invalid nucleotides).

translate_container(container, allow_alt=False, in_place=False)

Translate a Container instance.

Parameters:
  • align – a Container containing DNA sequences.
  • allow_alt – a boolean telling whether alternative start (initiation) codons should be considered. If False, codons are translated as a methionine (M) if, and only if, there are among the alternative start codons for the considered genetic code and they appear at the first position for the considered sequence. With this option, it is required that all sequences start by a valid initiation codon unless the first codon is missing, partial, or contains invalid data (in such cases, it is ignored).
  • in_place – place translated sequences into the original Container instance (this discards original data). By default, returns a new instance.
Returns:

By default, an original Container instance containing translated (protein) sequences. If in_place was True, return None.

translate_sequence(sequence, frame=None, allow_alt=False)

Translate a sequence.

Parameters:
  • sequence – a str, SequenceView or compatible instance containing DNA sequences. All sequence types yielding one-character strings are acceptable.
  • frame – a ReadingFrame instance providing the exon positions in the correct frame. By default, a non-segmented frame covering all sequences is assumed (in case the provided alignment is the coding region).
  • allow_alt – a boolean telling whether alternative start (initiation) codons should be considered. If False, codons are translated as a methionine (M) if, and only if, there are among the alternative start codons for the considered genetic code and they appear at the first position for the sequence. With this option, it is required that the sequence starts by a valid initiation codon unless the first codon is missing, partial, or contains invalid data (in such cases, it is ignored).
Returns:

A new str instance containing translated (protein) sequences.

Note

Character mapping is ignored for this method (sequences must be provided as is, with A, C, G and T and ambiguity characters if relevant; invalid characters cause translation to X in resulting amino acid sequences).

translate()

egglib.tools.translate(seq, frame=None, code=1, smart=False, delete_truncated=False, allow_alt=False, in_place=False)

Translates DNA nucleotide sequences to proteins. See the class Translator for more details. This is a convenience method allowing to translate nucleotide sequences in a single call. For repeatitive calls, direct use of Translator can be more efficient.

Parameters:
  • seq – input DNA nucleotide sequences. Accepted types are Align, Container, SequenceView or str (or also types compatible with str).
  • frame – reading frame as a ReadingFrame instance (see translate_align() for details). Not allowed if seq is a Container instance.
  • code – genetic code identifier (see Available genetic codes). Required to be an integer among the valid values. The default value is the standard genetic code.
  • smart – “smart” translation.
  • delete_truncated – skip codons that are truncated (by default, they are retained and translated as X. See Translator for details.
  • allow_alt – a boolean telling whether alternative start (initiation) codons should be considered. If False, codons are translated as a methionine (M) if, and only if, there are among the alternative start codons for the considered genetic code and they appear at the first position for the sequence. With this option, it is required that the sequence starts by a valid initiation codon unless the first codon is missing, partial, or contains invalid data (in such cases, it is ignored). If seq is an Align, leading gaps are ignored as long as they are a multiple of 3 (fully missing triplets).
  • in_place – place translated sequences in the provided Align or Container instance, overwritting initial data. Not allowed if seq is not of one of these two types. See translate_align() for details).
Returns:

Protein sequences as an Align or a Container if either of these types have been provided as seq, or a str otherwise. If in_place has been provided, returns None.

Changed in version 3.0.0: Added options code, frame and the options to define nucleotide values (default values are backward-compatible). Option strip is removed. Functionality is moved to class Translator.

orf_iter()

egglib.tools.orf_iter(sequence, code=1, smart=False, min_length=1, forward_only=False, force_start=True, start_ATG=False, force_stop=True)

Return an iterator over non-segmented open reading frames (ORFs) found in a provided DNA sequence in any of the six possible frames.

Parameters:
  • sequence – a str representing a DNA sequence.
  • code – genetic code identifier (see Available genetic codes). Required to be an integer among the valid values. The default value is the standard genetic code.
  • smart – “smart” translation.
  • min_length – minimum length of returned ORFs. This value must be at least 1. It is understood as the length of the encoded peptide (therefore, selected ORFs will have a length of least three times this value).
  • forward_only – consider only the three forward frames (do not consider the reverse strand).
  • force_start – if True, all returned ORFs are required to start with a start codon. Otherwise, all non-stop codons are included in ORFs. Note that alternative start codons (CTG and TTG for the standard genetic code) are also supported.
  • start_ATG – consider ATG as the only start codon (ignored if force_start is False.
  • force_stop – if True, all returned ORFs are required to end with a stop codon (this only excludes 3’-partial ORFs, that is ORFs that end with the end of the provided sequennce).
Returns:

An iterator over all detected ORFs. Each ORF is represented by a (start, stop, length, frame) tuple where start is the start position of the ORF and stop the stop position (such as sequence[start:stop] returns the ORF sequence or its reverse complement ), length is the ORF length and frame is the reading frame on which it was found: +1, +2, +3 are the frames on the forward strand (starting respectively at the first, second, and third base), and -1, -2, -3 are the frames on the reverse strand (starting respectively at the last, last but one, and last but two base).

New in version 3.0.0: Take over the basic functionality of longest_orf(); turned into an iterator. This new version comes with a better implementation, a changed signature (with a new forward_only option) and returns ORF positions instead of sequences.

longest_orf()

egglib.tools.longest_orf(*args, **kwargs)

Detect the longest open reading frame

Arguments are identical to orf_iter().

Returns:A (start, stop, length, frame) tuple (see the return value of orf_iter() for details), or None if no open reading frame fits the requirements (typically, the minimum length).

An exceptions.ValueError is raised if two or more open reading frames have the largest length.

Changed in version 2.0.1: Added options; return the trailing stop codon when appropriate.

Changed in version 2.1.0: Added option mini. The behaviour of previous versions is reproduced by setting mini to 0.

Changed in version 3.0.0: Most of the functionality is moved to orf_iter() with an updated interface.

Stop codon detection functions

egglib.tools.trailing_stops(align, frame=None, action=0, code=None, include_outgroup=False, gap='-', replacement='???')

Detect and (optionally fix) stop codons at the end of the sequences. The last three non-gap data of a sequence must form a single codon in the specified frame, meaning that if the final codon is interrupted by a gap of shifted out of frame, it will not be detected as a stop codon. If the last codon is truncated, it will also be considered as non gap. If the last codon is fully falling in a gap, the previous one is considered (and so on).

Parameters:
  • align – a Align containing aligned coding DNA sequences.
  • frame – a ReadingFrame instance providing the exon positions in the correct frame. By default, a non-segmented frame covering all sequences is assumed (in case the provided alignment is the coding region).
  • action

    an integer specifying what should be done if a stop codon is found at the end of a given sequence. Possible actions are listed in the following table:

    Code Action
    0 Nothing (just count them).
    1 Replace them by gaps, and delete the final three positions if they are made by gaps only.
    2 Replace them by the value given as replacement.

    Note that using action=1 is not stricly equivalent to using action=2, replacement='---' because the former deletes the last three positions of the alignment if needed while the latter does not.

  • code – genetic code identifier (see Available genetic codes). Required to be an integer among the valid values. The default value is the standard genetic code.
  • smart – “smart” translation.
  • include_outgroup – if True, process both ingroup and outgroup samples; if False, process only ingroup.
  • gap – the character representing gaps. It is allowed to pass the allele value as a single-character str, a single-character unicode, and as a int.
  • replacement – if action is set to 2, provide the three values that should be used to replace stop codons. This value must be a three-character str or a three-item sequence of integers. By default, replace final stop codons by uncharacterized bases.
Returns:

The number of sequences that had a trailing stop codons among the considered sequences (including outgroup if include_outgroup is True).

egglib.tools.iter_stops(align, frame=None, code=1, smart=False, include_outgroup=False)

Return an iterator providing the coordinates of all stop codons in the alignment (over all sequences). Only stop codons in the specified frame are detected, excluding all those that are segmented by a gap or are in a shifted frame. Each iteration returns a (sample, position, flag) where sample is the sample index, position is the position of the first base of the stop codon, and flag is True if the sample belongs to the ingroup and False if the sample belongs to the outgroup.

Parameters:
  • align – a Align containing aligned coding DNA sequences. Values should consist in A, C, G and T (and possibly IUPAC ambiguity codes if smart is toggled). Other values are treated as missing data.
  • frame – a ReadingFrame instance providing the exon positions in the correct frame. By default, a non-segmented frame covering all sequences is assumed (in case the provided alignment is the coding region).
  • code – genetic code identifier (see Available genetic codes). Required to be an integer among the valid values. The default value is the standard genetic code.
  • smart – “smart” translation.
  • include_outgroup – if True, process both ingroup and outgroup samples; if False, process only ingroup.
Returns:

An iterator over the (sample, position, flag)` tuples corresponding to each stop codon found in the alignment (see above).

egglib.tools.has_stop(align, frame=None, code=1, smart=False, include_outgroup=False)

Return True if the alignment contains at least one codon stop at any position in any sequence, and False otherwise. Only sto codons in the specified frame are detected, excluding all those tha are segmented by a gap or are in a shifted frame.

Parameters:
  • align – a Align containing aligned coding DNA sequences.
  • frame – a ReadingFrame instance providing the exon positions in the correct frame. By default, a non-segmented frame covering all sequences is assumed (in case the provided alignment is the coding region).
  • frame – a ReadingFrame instance providing the exon positions in the correct frame. By default, a non-segmented frame covering all sequences is assumed (in case the provided alignment is the coding region).
  • code – genetic code identifier (see Available genetic codes). Required to be an integer among the valid values. The default value is the standard genetic code.
  • include_outgroup – if True, process both ingroup and outgroup samples; if False, process only ingroup.
Returns:

A boolean.

int2codon()

egglib.tools.int2codon(codon)

Return the three-character string of a codon encoded with a single integer (such as those obtained from CodingDiversity). Return "???" if the encoded codon is out of expected range.