External application tools¶
Application paths¶
Application paths can be set using the following syntax. A
ValueError
is raised if the automatic test fails.
The change is valid for the current session only unless save()
is
used:
egglib.wrappers.paths[app] = path
And application paths are accessed as followed:
egglib.wrappers.paths[app]

egglib.wrappers.paths.
autodetect
(verbose=False)¶ Autoconfigure application paths based on default command names.
Parameters: verbose – if True
, print running information and test results.

egglib.wrappers.path.
load
()¶ Load values of application paths from the configuration file located within the package. All values currently set are discarded.

egglib.wrappers.path.
save
()¶ Save current values of application paths in the configuration file located within the package. This action may require administrator rights. All values currently set will be reloaded at next import of the package.
Phylogeny¶

egglib.wrappers.
phyml
(align, model, include_outgroup=True, export_labels=True, rates=1, boot=0, start_tree='nj', fixed_topology=False, fixed_brlens=False, search='NNI', num_rand_start=5, freq=None, TiTv=4.0, pinv=0.0, alpha=None, use_median=False, free_rates=False, seed=None, verbose=False)¶ Reconstruct maximumlikelihood phylogeny using PhyML.
PhyML is a program performing maximumlikelihood phylogeny estimation using nucleotide or amino acid sequence alignments. Ref.: Guindon S., J.F. Dufayard, V. Lefort, M. Animisova, W. Hordijk, and O. Gascuel. 2010. New algorithms and methods to estimate maximumlikelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59: 307321.
Parameters:  align – input sequence alignment as an
Align
instance.  model – substitution model to use (see list below).
 include_ougroup – boolean indicating whether sequences from the outgroup should be used for the phylogeny estimation.
 export_labels – boolean indicating whether the group labels should be included in the names of sequences (such as they will appear in the resulting tree).
 rates – number of discrete categories of evolutionary rate. If different of 1, fits a gamma distribution of rates.
 boot – number of bootstrap repetitions. Values of 1, 2 and 4 activate one the testbased branch support evaluation methods that provide faster alternatives to bootstrap repetitions (1: aLRT statistics, 2: Chi2based parametric tests, 4: Shimodaira and Hasegawalike statistics). A value of 0 provides no branch support at all.
 start_tree – starting topology used by the program. Possible
values are the string
nj
(neighbourjoining tree), the stringpars
(maximumparsimony tree), the stringrand
(random tree, only if search isSPR
), and aTree
instance containing a userprovided topology. In the latter case, the names of leaves of the tree must match the names of the input alignment (without group labels), implying that names cannot be repeated.  fixed_topology – boolean indicating whether the topology
provided in the
Tree
instance passed as start_tree argument should be NOT be improved.  fixed_brlens – boolean indicating whether the branch lengths
provided in the
Tree
instance passed as start_tree argument should be NOT be improved. All branch lengths must be specified. Automatically sets fixed_topology toTrue
.  search – search method. Available modes are:
NNI
(nearest neighbour interchange, fastest),SPR
(subtree pruning and regrafting), andBEST
(best of both methods).  num_rand_start – number of random start trees (only considered
if search is
SPR
and start_tree isrand
).  freq – nucleotide or amino acid frequencies. Possible values
are the string
o
(observed, frequencies measured from the data), the stringm
(estimated by maximum likelihood for nucleotides, or retrieved from the substitution model for amino acids), or a fouritem tuples provided the relative frequencies of A, C, G and T respectively (only for nucleotides). By default, useo
for nucleotides andm
for amino acids.  TiTv – transition/transversion ratio. If
None
, estimated by maximum likelihood. Otherwise, must be a stricly positive value. Ignored if data are not nucleotides or if the model does not support  pinv – proportion of invariable sites. If
None
estimated by maximum likelihood. Otherwise, must be in the range [0, 1].  alpha – gamma shape parameter. If
None
, estimated by maximum likelihood. Otherwise, must be a strictly positive value. Ignored if rates is 1 or if free_rates isTrue
.  use_median – boolean indicating whether the median (instead of
the mean) should be use to report values for rate from the
discretized gamma distribution. Ignored if rates is 1 or if
free_rates is
True
.  free_rates – boolean indicating whether a mixture model should be used for substitution rate categories instead of the discretized gamma. In this case all reates and their frequencies will be estimated. Requires that rates is larger than 1.
 seed – pseudorandom number generator seed. Must be a stricly positive integer, preferably large.
 verbose – boolean indicating whether standard output of PhyML should be displayed.
Returns: A
(tree, stats)
tuple where tree is aTree
instance and stats is adict
containing the following statistics or estimated parameters:lk
– loglikelihood of the returned tree and model.pars
– parsimony score of the returned tree.size
– length of the returned tree.rates
– relative substitution rates, as a list providing values in the following order: A↔C, A↔G, A↔T, C↔G, C↔T, and G↔T (only available if model isGTR
or custom, only for nucleotide sequences).alpha
– gamma shape parameter (only if the number of rate categories is larger than 1 and iffree_rates
wasFalse
).cats
– list of(rate, proportion)
tuples for each discrete rate category (only iffree_rates
wasTrue
, implying that the number of rates was larger than 1).freq
– list of the relative base frequencies, in the following order: A, C, G, and T (only for nucleotide sequences).ti/tv
– transition/transversion ratio (available for the modelsK80
,HKY85
,F84
, andTN93
). For theTN93
model, the resulting value is a pair of transition/transversion ratios, one for purines and one for pyrimidines (in that order).pinv
– proportion of invariable sites (only if the corresponding option was not set to 0).
The choice of the model defines the type of data that are expected. The available models are:
Nucleotides:
Code Full name Rates Base frequencies JC69
Jukes and Cantor 1969 one equal K80
Kimura 1980 two equal F81
Felsenstein 1981 one unequal HKY85
Hasegawa, Kishino & Yano 1985 two unequal F84
Felsenstein 1984 two unequal TN93
Tamura and Nei 1993 three unequal GTR
general time reversible six unequal In addition, custom nucleotide substitution models can be specified. In that case, model must be a sixcharacter strings of numeric characters specifying which of the six (reversable) substitution rates are allowed to vary. The onerate model is specified by the string
000000
, the tworate model (separate transition and transversion rates) is specified by010010
, and the GTR model is specified by012345
. The substitution rates are specified in the following order: A↔C, A↔G, A↔T, C↔G, C↔T, and G↔T.Amino acids:
Code Authors LG
Le & Gascuel (Mol. Biol. Evol. 2008) WAG
Whelan & Goldman (Mol. Biol. Evol. 2001) JTT
Jones, Taylor & Thornton (CABIOS 1992) MtREV
Adachi & Hasegawa (in Computer Science Monographs 1996) Dayhoff
Dayhoff et al. (in Atlas of Protein Sequence and Structure 1978) DCMut
Kosiol & Goldman (Mol. Biol. Evol. 2004) RtREV
Dimmic et al. (J. Mol. Evol. 2002) CpREV
Adachi et al. (J. Mol. Evol. 2000) VT
Muller & Vingron (J. Comput. Biol. 2000) Blosum62
Henikoff & Henikoff (PNAS 1992) MtMam
Cao et al. (J. Mol. Evol. 1998) MtArt
Abascal, Posada & Zardoya (Mol. Biol. Evol. 2007) HIVw
Nickle et al. (PLoS One 2007) HIVb
ibid.
Changed in version 3.0.0: No more default value for model option. Added custom model for nucleotides. Changed SH pseudobootstrap option flag from 3 to 4. quiet function replaced by verbose. Several additional options are added. The syntax for input a user tree is modified. The second item in the returned tuple is a dictionary of statistics.
 align – input sequence alignment as an

egglib.wrappers.
codeml
(align, tree, model, code=1, ncat=None, codon_freq=2, include_outgroup=True, verbose=False, get_files=False, kappa=2.0, fix_kappa=False, omega=0.4)¶ Fit nucleotide substitution models using the CodeML program from the PAML package.
Parameters:  align – an
Align
containing a coding sequence alignment. The number of sequences must be at least 3, the length of the alignment is required to be a multiple of 3, there must be no stop codons (even final stop codons) and there must not be any duplicated sequence name.  tree – a
Tree
providing the phylogenetic relationships between samples. The name of the sequences in theAlign
and in theTree
are required to match. If tree isNone
, a star topology is used. If the tree contains branch length or node labels, they are discounted, except for PAML node tags (#x
and$x
wherex
is an integer) that are allowed both as nodel labels. If one wants to label a terminal branch of the tree, they can add the label at the end of the sample name (with an optional separating white space). The tree must not be rooted (if there is a birfurcation at the base, an error will be caused).  model –
model. The list of model names appears below:
M0
– oneratio model (1 parameter).free
– all branches have a different ratio (1 parameter per branch).nW
– several sets of branches. Requires labelling of branches of the tree (1 parameter per set of branches).M1a
– nearlyneutral model (2 parameters).M2a
– positive selection model (4 parameters).M3
– discrete model. Requires setting ncat (2 * ncat  1 parameters).M4
– frequencies model. Requires setting ncat (ncat  1 parameters).M7
– betadistribution model. Requires setting ncat (2 parameters).M8a
– beta + single ratio, additional ratio fixed to 1. Requires setting ncat (3 parameters).M8
– beta + single ratio. Requires setting ncat (4 parameters).A0
– null branchsite model. Requires labelling of branches of the tree with two different labels (3 parameters).A
– branchsite model with positive selection. Requires labelling of branches of the tree with two different labels (4 parameters).C0
– null model for model C (M2a_rel). Does not require branch labelling (4 parameters).C
– branchsite model. Requires labelling of branches (5 parameters).D
– discrete branchsite model. Requires labelling of branches and requires setting ncat to either 2 or 3 (4 or 6 parameters, respectively).
The number of parameters given for each model concern the dN/dS ratios only. Refer to PAML documentation or the following references for more details and recommendations: Bielawski, J.P. & Z. Yang. 2004. A maximum likelihood method for detecting functional divergence at individual codon sites, with application to gene family evolution. J. Mol. Evol. 59:121132; Yang Z., R. Nielsen, N. Goldman & A.M.K. Pedersen. 2000; Codonsubstitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431449. Yang, Z., and R. Nielsen. 2002. Codonsubstitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19:908917; Zhang, J., R. Nielsen & Z. Yang. 2005. Evaluation of an improved branchsite lieklihood method for detecting positive selection at the molecular level. Mol. Biol. Evol. 22:4722479.
 code – genetic code identifier (see Available genetic codes). Required to be an integer among the valid values. The default value is the standard genetic code. Only codes 111 are available.
 ncat – number of dN/dS categories. Only a subset of models require that the number of categories to be specified. See models.
 codon_freq –
an integer specifying the model for codon frequencies. Must be one of:
 0 – equal frequencies.
 1 – codon frequencies from base frequencies (3 degrees of liberty).
 2 – codon frequencies from base frequencies at three codon positions (9 degrees of liberty).
 3 – independent codon frequencies (60 degrees of liberty).
 include_outgroup – boolean indicating whether sequences from the outgroup should be used for the phylogeny estimation.
 verbose – boolean indicating whether standard output of CodeML should be displayed.
 get_files – boolean indicating whether the raw content of CodeML output files should be included in the returned data.
 kappa – starting value for the transition/transversion rate ratio.
 fix_kappa – tell if the transition/transversion rate ratio should be fixed to its starting value (otherwise, it is estimated as a free parameter).
 omega – starting value for the dN/dS ratio (strictly positive value).
Returns: A
dict
holding results. The keys defined in the returned dictionary are:model
– model name.lk
– loglikelihood.np
– number of parameters of the model.kappa
– fixed or estimated value of the transition/transversion rate ratio.beta
– if model isM7
,M8a
, orM8
, atuple
with the p and q parameters of the beta distribution of neutral dN/dS ratios; otherwise,None
.K
– number of dN/dS ratio categories. Equals to 0 for thefree
model, to the number of branch categories for thenW
model, and to the number of site categories otherwise. This value is not necessarily equal to the ncat argument becauseM8a
andM8
models add a category, and because it has a different meaning for modelnW
.num_tags
– number of branch categories detected from the imported tree (irrespective to the model that has been fitted). If the star topology has been used (tree=None
), this value is 1.omega
– estimated dN/dS ratio or ratios. The structure of the value depends on the model:M0
model – a single value.free
model –None
(ratios are available as node labels in the tree available astree_ratios
).nW
model – alist
of dN/dS ratios for all branch categories (they are listed in the order corresponding to branch labels). Discrete models (
M1a
,M2a
,M3
,M4
,C0
,M7
,M8a
, andM8
) – alist
ofK
dN/dS ratios. The frequency of each dN/dS category is available isfreq
. A0
andA
models – atuple
of twolist
of 4 items each, containing respectively the background and foreground dN/dS ratios. The frequency of each dN/dS category is available isfreq
.C
andD
models – atuple
ofnum_tags
list
(onelist
for each set of branches, as defined by branch labels found in the provided tree), each of them containingK
dN/dS ratios. The frequency of each dN/dS category is available isfreq
.
freq
– the frequency of dN/dS ratio categories. If defined, it is alist
ofK
values. This entry isNone
for modelsM0
,free
, andnW
.length
– total length of tree after estimating branch lengths with the specified model.tree
– the tree with fitted branch lengths, as aTree
instance. Branch lengths are expressed in terms of the model of codon evolution.length_dS
– total length of tree in terms of synonymous substitutions. Only available withM0
,free
, andnW
models.length_dN
– total length of tree in terms of nonsynonymous substitutions. Only available withM0
,free
, andnW
models.tree_dS
– aTree
instance with branch lengths expressed in terms of synonymous substitutions. Only available withfree
andnW
models.tree_dN
– aTree
instance with branch lengths expressed in terms of nonsynonymous substitutions. Only available withfree
andnW
models.tree_ratios
– aTree
instance with the dN/dS ratios included as branch labels. Only available withfree
andnW
models.site_w
– adict
containing posterior predictions of site dN/dS ratios. Not available for modelsM0
,free
, andnW
(in that cases, the value isNone
). Thedict
contains the following keys:method
– on the stringsNEB
andBEB
.aminoacid
– the list of reference amino acids for all amino acid sites of the alignment (they are taken from the first sequence in the original alignment).proba
– the list of posterior probabilites of the dN/dS categories for all amino acid sites of the alignment. For each site, atuple
ofK
(the number of dN/dS categories) is provided.best
– the index of the best category for each site.postw
– list of the posterior dN/dS estimate for all sites (None
if not available).postwsd
– list of the standard deviation of the dN/dS estimate for all sites (always available ifpostw
is available and the method isBEB
,None
otherwise).P(w>1)
– probability that the dN/dS ratio is greater than 1 for all sites (None
if not available).
main_output
– raw content of the main CodeML output file. This key is not present if the option get_files is not set toTrue
.rst_output
– raw content of therst
detailed CodeML output file. This key is not present if the option get_files is not set toTrue
.
Changed in version 3.0.0: Turned into a singe function, interface changes (more models, more options, more results).
 align – an
Multiple alignment¶

egglib.wrappers.
clustal
(source, ref=None, data_type=None, include_outgroup=True, full=False, full_iter=False, cluster_size=100, use_kimura=True, num_iter=1, threads=1, keep_order=False, verbose=False)¶ Multiple sequence alignment using Clustal Omega.
Parameters:  source –
a
Container
orAlign
containing the sequences to align. If aContainer
is provided, sequences are assumed to be unaligned, and, if aAlign
is provided, sequences are assumed to be aligned. The list below explains what is done based on the type of source and whether a value is provided for ref: If source is a
Container
and ref isNone
, the sequences in source are aligned.  If source is an
Align
and ref isNone
, a hidden Markov model is built from the alignment, then the alignment is reset and sequences are realigned.  If source is an
Align
and an alignment is provided as ref, the two alignments are preversed (their columns are left unchanged), and they aligned with respect to each other.  If source is a
Container
and an alignment is provided as ref, a hidden Markov model is built from ref, then source is aligned using it, and finally the resulting alignment is aligned with respect to ref as described for the previous case.
source must contain at least two sequences unless it is an
Align
and a value is provided for ref (in that case, it must contain at least one sequence).  If source is a
 ref – an
Align
instance providing an external alignment. See avobe for more details. ref must contain at least one sequence. Sequences must be aligned.  data_type – one of the strings
Protein
,DNA
orRNA
. IfNone
, let Clustal Omega guess.  include_ougroup – boolean indicating whether sequences from the outgroup should be included in the alignment.
 full – use full distance matrix to determine guide tree (the default is using the faster and less memoryintensive mBed approximation).
 full_iter – use full distance matrix to determine guide tree during iterations.
 cluster_size – size of clusters (as a number of sequences) used in the mBed algorithm.
 use_kimura – use Kimura correction for estimating
wholealignment distance (only available if a protein alignment
has been provided as source and the option data_type has
been set to
Protein
).  num_iter – number of iterations allowing to improve the quality of the alignment. Must be a number ≥1 or a pair of numbers ≥1. If the value is a pair of numbers, they specify the number of guide tree iterations and hidden Markov model iterations, respectively. If a single value is provided, iterations couple guide tree and hidden Markov model.
 threads – number of threads for parallelization (available for parts of the program).
 keep_order – return the sequences in the same order as they were loaded.
 verbose – display Clustal Omega’s console output.
Returns: An
Align
instance containing aligned sequences.Changed in version 3.0.0: Ported to Clustal Omega and added support for more options.
 source –

egglib.wrappers.
muscle
(source, ref=None, include_outgroup=True, verbose=False, **kwargs)¶ Performs multiple alignment using Muscle.
Muscle’s default options tend to produce highquality alignments but may be long to run on large data set. Muscle’s author recommends using the option
maxiters=2
for large data sets, and, for fast alignment (in particular of closely related sequences):maxiters=1 diags=True aa_profile='sv' distance1='kbit20_3'
(for amino acid sequences) andmaxiters=1 diags=True
(for nucleotide sequences).Parameters:  source – a
Container
orAlign
containing sequences to align. If anAlign
is provided, sequences are assumed to be already aligned and alignment will be refined (using therefine
option of Muscle), unless an alignment is also provided as ref. In the latter case, the two alignments are preversed (their columns are left unchanged), and they aligned with respect to each other.  ref – an
Align
instance providing an alignement that should be aligned with respect to the alignment provided as source. If ref is provided, it is required both source and ref areAlign
instances.  include_outgroup – boolean indicating whether sequences from the outgroup should be included in the alignment.
 verbose – display Muscle’s console output.
 kwargs –
other keyword arguments are passed to Muscle. The available options are listed below:
option value anchors
a boolean brenner
a boolean cluster
a boolean diags
a boolean diags1
a boolean diags2
a boolean dimer
a boolean teamgaps4
a boolean SUEFF
a float aa_profile
one of: le
,sp
,sv
anchorspacing
an integer center
a float cluster1
one of: upgma
,upgmb
,neighborjoining
diagbreak
an integer diaglength
an integer diagmargin
an integer distance1
one of: kmer6_6
,kmer20_3
,kmer20_4
,kbit20_3
,kmer4_6
distance2
one of: pctidkimura
,pctidlog
gapopen
a float hydro
an integer hydrofactor
a float maxiters
an integer maxtrees
an integer minbestcolscore
a float minsmoothscore
a float nt_profile
one of: spn
objscore
one of: sp
,ps
,dp
,xp
,spf
,spm
refinewindow
an integer root1
one of: pseudo
,midlongestspan
,minavgleafdist
seqtype
one of: protein
,dna
,rna
,auto
smoothscoreceil
a float weight1
one of: none
,henikoff
,henikoffpb
,gsc
,clustalw
,threeway
weight2
one of: none
,henikoff
,henikoffpb
,gsc
,clustalw
,threeway
For a description of options, see the Muscle manual. Most of Muscle’s options are available. Note that function takes no flag option, and Muscle’s flag options are passed as boolean keyword arguments (except options relative to the amino acid or nucleotide profile score options, that are passed as string as
aa_profile
andnt_profile
, respectively. The order of options is preserved.
Returns: An
Align
containing aligned sequences.Changed in version 3.0.0: Added support for most options.
 source – a