External application tools

Application paths

Application paths can be set using the following syntax. A ValueError is raised if the automatic test fails. The change is valid for the current session only unless save() is used:

egglib.wrappers.paths[app] = path

And application paths are accessed as followed:

egglib.wrappers.paths[app]
egglib.wrappers.paths.autodetect(verbose=False)

Auto-configure application paths based on default command names.

Parameters:verbose – if True, print running information and test results.
egglib.wrappers.path.load()

Load values of application paths from the configuration file located within the package. All values currently set are discarded.

egglib.wrappers.path.save()

Save current values of application paths in the configuration file located within the package. This action may require administrator rights. All values currently set will be reloaded at next import of the package.

Phylogeny

egglib.wrappers.phyml(align, model, include_outgroup=True, export_labels=True, rates=1, boot=0, start_tree='nj', fixed_topology=False, fixed_brlens=False, search='NNI', num_rand_start=5, freq=None, TiTv=4.0, pinv=0.0, alpha=None, use_median=False, free_rates=False, seed=None, verbose=False)

Reconstruct maximum-likelihood phylogeny using PhyML.

PhyML is a program performing maximum-likelihood phylogeny estimation using nucleotide or amino acid sequence alignments. Ref.: Guindon S., J.-F. Dufayard, V. Lefort, M. Animisova, W. Hordijk, and O. Gascuel. 2010. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59: 307-321.

Parameters:
  • align – input sequence alignment as an Align instance.
  • model – substitution model to use (see list below).
  • include_ougroup – boolean indicating whether sequences from the outgroup should be used for the phylogeny estimation.
  • export_labels – boolean indicating whether the group labels should be included in the names of sequences (such as they will appear in the resulting tree).
  • rates – number of discrete categories of evolutionary rate. If different of 1, fits a gamma distribution of rates.
  • boot – number of bootstrap repetitions. Values of -1, -2 and -4 activate one the test-based branch support evaluation methods that provide faster alternatives to bootstrap repetitions (-1: aLRT statistics, -2: Chi2-based parametric tests, -4: Shimodaira and Hasegawa-like statistics). A value of 0 provides no branch support at all.
  • start_tree – starting topology used by the program. Possible values are the string nj (neighbour-joining tree), the string pars (maximum-parsimony tree), the string rand (random tree, only if search is SPR), and a Tree instance containing a user-provided topology. In the latter case, the names of leaves of the tree must match the names of the input alignment (without group labels), implying that names cannot be repeated.
  • fixed_topology – boolean indicating whether the topology provided in the Tree instance passed as start_tree argument should be NOT be improved.
  • fixed_brlens – boolean indicating whether the branch lengths provided in the Tree instance passed as start_tree argument should be NOT be improved. All branch lengths must be specified. Automatically sets fixed_topology to True.
  • search – search method. Available modes are: NNI (nearest- neighbour interchange, fastest), SPR (subtree pruning and regrafting), and BEST (best of both methods).
  • num_rand_start – number of random start trees (only considered if search is SPR and start_tree is rand).
  • freq – nucleotide or amino acid frequencies. Possible values are the string o (observed, frequencies measured from the data), the string m (estimated by maximum likelihood for nucleotides, or retrieved from the substitution model for amino acids), or a four-item tuples provided the relative frequencies of A, C, G and T respectively (only for nucleotides). By default, use o for nucleotides and m for amino acids.
  • TiTv – transition/transversion ratio. If None, estimated by maximum likelihood. Otherwise, must be a stricly positive value. Ignored if data are not nucleotides or if the model does not support
  • pinv – proportion of invariable sites. If None estimated by maximum likelihood. Otherwise, must be in the range [0, 1].
  • alpha – gamma shape parameter. If None, estimated by maximum likelihood. Otherwise, must be a strictly positive value. Ignored if rates is 1 or if free_rates is True.
  • use_median – boolean indicating whether the median (instead of the mean) should be use to report values for rate from the discretized gamma distribution. Ignored if rates is 1 or if free_rates is True.
  • free_rates – boolean indicating whether a mixture model should be used for substitution rate categories instead of the discretized gamma. In this case all reates and their frequencies will be estimated. Requires that rates is larger than 1.
  • seed – pseudo-random number generator seed. Must be a stricly positive integer, preferably large.
  • verbose – boolean indicating whether standard output of PhyML should be displayed.
Returns:

A (tree, stats) tuple where tree is a Tree instance and stats is a dict containing the following statistics or estimated parameters:

  • lk – log-likelihood of the returned tree and model.
  • pars – parsimony score of the returned tree.
  • size – length of the returned tree.
  • rates – relative substitution rates, as a list providing values in the following order: A↔C, A↔G, A↔T, C↔G, C↔T, and G↔T (only available if model is GTR or custom, only for nucleotide sequences).
  • alpha – gamma shape parameter (only if the number of rate categories is larger than 1 and if free_rates was False).
  • cats – list of (rate, proportion) tuples for each discrete rate category (only if free_rates was True, implying that the number of rates was larger than 1).
  • freq – list of the relative base frequencies, in the following order: A, C, G, and T (only for nucleotide sequences).
  • ti/tv – transition/transversion ratio (available for the models K80, HKY85, F84, and TN93). For the TN93 model, the resulting value is a pair of transition/transversion ratios, one for purines and one for pyrimidines (in that order).
  • pinv – proportion of invariable sites (only if the corresponding option was not set to 0).

The choice of the model defines the type of data that are expected. The available models are:

  • Nucleotides:

    Code Full name Rates Base frequencies
    JC69 Jukes and Cantor 1969 one equal
    K80 Kimura 1980 two equal
    F81 Felsenstein 1981 one unequal
    HKY85 Hasegawa, Kishino & Yano 1985 two unequal
    F84 Felsenstein 1984 two unequal
    TN93 Tamura and Nei 1993 three unequal
    GTR general time reversible six unequal

    In addition, custom nucleotide substitution models can be specified. In that case, model must be a six-character strings of numeric characters specifying which of the six (reversable) substitution rates are allowed to vary. The one-rate model is specified by the string 000000, the two-rate model (separate transition and transversion rates) is specified by 010010, and the GTR model is specified by 012345. The substitution rates are specified in the following order: A↔C, A↔G, A↔T, C↔G, C↔T, and G↔T.

  • Amino acids:

    Code Authors
    LG Le & Gascuel (Mol. Biol. Evol. 2008)
    WAG Whelan & Goldman (Mol. Biol. Evol. 2001)
    JTT Jones, Taylor & Thornton (CABIOS 1992)
    MtREV Adachi & Hasegawa (in Computer Science Monographs 1996)
    Dayhoff Dayhoff et al. (in Atlas of Protein Sequence and Structure 1978)
    DCMut Kosiol & Goldman (Mol. Biol. Evol. 2004)
    RtREV Dimmic et al. (J. Mol. Evol. 2002)
    CpREV Adachi et al. (J. Mol. Evol. 2000)
    VT Muller & Vingron (J. Comput. Biol. 2000)
    Blosum62 Henikoff & Henikoff (PNAS 1992)
    MtMam Cao et al. (J. Mol. Evol. 1998)
    MtArt Abascal, Posada & Zardoya (Mol. Biol. Evol. 2007)
    HIVw Nickle et al. (PLoS One 2007)
    HIVb ibid.

Changed in version 3.0.0: No more default value for model option. Added custom model for nucleotides. Changed SH pseudo-bootstrap option flag from -3 to -4. quiet function replaced by verbose. Several additional options are added. The syntax for input a user tree is modified. The second item in the returned tuple is a dictionary of statistics.

egglib.wrappers.codeml(align, tree, model, code=1, ncat=None, codon_freq=2, include_outgroup=True, verbose=False, get_files=False, kappa=2.0, fix_kappa=False, omega=0.4)

Fit nucleotide substitution models using the CodeML program from the PAML package.

Parameters:
  • align – an Align containing a coding sequence alignment. The number of sequences must be at least 3, the length of the alignment is required to be a multiple of 3, there must be no stop codons (even final stop codons) and there must not be any duplicated sequence name.
  • tree – a Tree providing the phylogenetic relationships between samples. The name of the sequences in the Align and in the Tree are required to match. If tree is None, a star topology is used. If the tree contains branch length or node labels, they are discounted, except for PAML node tags (#x and $x where x is an integer) that are allowed both as nodel labels. If one wants to label a terminal branch of the tree, they can add the label at the end of the sample name (with an optional separating white space). The tree must not be rooted (if there is a birfurcation at the base, an error will be caused).
  • model

    model. The list of model names appears below:

    • M0 – one-ratio model (1 parameter).
    • free – all branches have a different ratio (1 parameter per branch).
    • nW – several sets of branches. Requires labelling of branches of the tree (1 parameter per set of branches).
    • M1a – nearly-neutral model (2 parameters).
    • M2a – positive selection model (4 parameters).
    • M3 – discrete model. Requires setting ncat (2 * ncat - 1 parameters).
    • M4 – frequencies model. Requires setting ncat (ncat - 1 parameters).
    • M7 – beta-distribution model. Requires setting ncat (2 parameters).
    • M8a – beta + single ratio, additional ratio fixed to 1. Requires setting ncat (3 parameters).
    • M8 – beta + single ratio. Requires setting ncat (4 parameters).
    • A0 – null branch-site model. Requires labelling of branches of the tree with two different labels (3 parameters).
    • A – branch-site model with positive selection. Requires labelling of branches of the tree with two different labels (4 parameters).
    • C0 – null model for model C (M2a_rel). Does not require branch labelling (4 parameters).
    • C – branch-site model. Requires labelling of branches (5 parameters).
    • D – discrete branch-site model. Requires labelling of branches and requires setting ncat to either 2 or 3 (4 or 6 parameters, respectively).

    The number of parameters given for each model concern the dN/dS ratios only. Refer to PAML documentation or the following references for more details and recommendations: Bielawski, J.P. & Z. Yang. 2004. A maximum likelihood method for detecting functional divergence at individual codon sites, with application to gene family evolution. J. Mol. Evol. 59:121-132; Yang Z., R. Nielsen, N. Goldman & A.M.K. Pedersen. 2000; Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431-449. Yang, Z., and R. Nielsen. 2002. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19:908-917; Zhang, J., R. Nielsen & Z. Yang. 2005. Evaluation of an improved branch-site lieklihood method for detecting positive selection at the molecular level. Mol. Biol. Evol. 22:472-2479.

  • code – genetic code identifier (see Available genetic codes). Required to be an integer among the valid values. The default value is the standard genetic code. Only codes 1-11 are available.
  • ncat – number of dN/dS categories. Only a subset of models require that the number of categories to be specified. See models.
  • codon_freq

    an integer specifying the model for codon frequencies. Must be one of:

    • 0 – equal frequencies.
    • 1 – codon frequencies from base frequencies (3 degrees of liberty).
    • 2 – codon frequencies from base frequencies at three codon positions (9 degrees of liberty).
    • 3 – independent codon frequencies (60 degrees of liberty).
  • include_outgroup – boolean indicating whether sequences from the outgroup should be used for the phylogeny estimation.
  • verbose – boolean indicating whether standard output of CodeML should be displayed.
  • get_files – boolean indicating whether the raw content of CodeML output files should be included in the returned data.
  • kappa – starting value for the transition/transversion rate ratio.
  • fix_kappa – tell if the transition/transversion rate ratio should be fixed to its starting value (otherwise, it is estimated as a free parameter).
  • omega – starting value for the dN/dS ratio (strictly positive value).
Returns:

A dict holding results. The keys defined in the returned dictionary are:

  • model – model name.
  • lk – log-likelihood.
  • np – number of parameters of the model.
  • kappa – fixed or estimated value of the transition/transversion rate ratio.
  • beta – if model is M7, M8a, or M8, a tuple with the p and q parameters of the beta distribution of neutral dN/dS ratios; otherwise, None.
  • K – number of dN/dS ratio categories. Equals to 0 for the free model, to the number of branch categories for the nW model, and to the number of site categories otherwise. This value is not necessarily equal to the ncat argument because M8a and M8 models add a category, and because it has a different meaning for model nW.
  • num_tags – number of branch categories detected from the imported tree (irrespective to the model that has been fitted). If the star topology has been used (tree=None), this value is 1.
  • omega – estimated dN/dS ratio or ratios. The structure of the value depends on the model:
    • M0 model – a single value.
    • free model – None (ratios are available as node labels in the tree available as tree_ratios).
    • nW model – a list of dN/dS ratios for all branch categories (they are listed in the order corresponding to branch labels).
    • Discrete models (M1a, M2a, M3, M4, C0, M7, M8a, and M8) – a list of K dN/dS ratios. The frequency of each dN/dS category is available is freq.
    • A0 and A models – a tuple of two list of 4 items each, containing respectively the background and foreground dN/dS ratios. The frequency of each dN/dS category is available is freq.
    • C and D models – a tuple of num_tags list (one list for each set of branches, as defined by branch labels found in the provided tree), each of them containing K dN/dS ratios. The frequency of each dN/dS category is available is freq.
  • freq – the frequency of dN/dS ratio categories. If defined, it is a list of K values. This entry is None for models M0, free, and nW.
  • length – total length of tree after estimating branch lengths with the specified model.
  • tree – the tree with fitted branch lengths, as a Tree instance. Branch lengths are expressed in terms of the model of codon evolution.
  • length_dS – total length of tree in terms of synonymous substitutions. Only available with M0, free, and nW models.
  • length_dN – total length of tree in terms of non-synonymous substitutions. Only available with M0, free, and nW models.
  • tree_dS – a Tree instance with branch lengths expressed in terms of synonymous substitutions. Only available with free and nW models.
  • tree_dN – a Tree instance with branch lengths expressed in terms of non-synonymous substitutions. Only available with free and nW models.
  • tree_ratios – a Tree instance with the dN/dS ratios included as branch labels. Only available with free and nW models.
  • site_w – a dict containing posterior predictions of site dN/dS ratios. Not available for models M0, free, and nW (in that cases, the value is None). The dict contains the following keys:
    • method – on the strings NEB and BEB.
    • aminoacid – the list of reference amino acids for all amino acid sites of the alignment (they are taken from the first sequence in the original alignment).
    • proba – the list of posterior probabilites of the dN/dS categories for all amino acid sites of the alignment. For each site, a tuple of K (the number of dN/dS categories) is provided.
    • best – the index of the best category for each site.
    • postw – list of the posterior dN/dS estimate for all sites (None if not available).
    • postwsd – list of the standard deviation of the dN/dS estimate for all sites (always available if postw is available and the method is BEB, None otherwise).
    • P(w>1) – probability that the dN/dS ratio is greater than 1 for all sites (None if not available).
  • main_output – raw content of the main CodeML output file. This key is not present if the option get_files is not set to True.
  • rst_output – raw content of the rst detailed CodeML output file. This key is not present if the option get_files is not set to True.

Changed in version 3.0.0: Turned into a singe function, interface changes (more models, more options, more results).

Multiple alignment

egglib.wrappers.clustal(source, ref=None, data_type=None, include_outgroup=True, full=False, full_iter=False, cluster_size=100, use_kimura=True, num_iter=1, threads=1, keep_order=False, verbose=False)

Multiple sequence alignment using Clustal Omega.

Parameters:
  • source

    a Container or Align containing the sequences to align. If a Container is provided, sequences are assumed to be unaligned, and, if a Align is provided, sequences are assumed to be aligned. The list below explains what is done based on the type of source and whether a value is provided for ref:

    • If source is a Container and ref is None, the sequences in source are aligned.
    • If source is an Align and ref is None, a hidden Markov model is built from the alignment, then the alignment is reset and sequences are realigned.
    • If source is an Align and an alignment is provided as ref, the two alignments are preversed (their columns are left unchanged), and they aligned with respect to each other.
    • If source is a Container and an alignment is provided as ref, a hidden Markov model is built from ref, then source is aligned using it, and finally the resulting alignment is aligned with respect to ref as described for the previous case.

    source must contain at least two sequences unless it is an Align and a value is provided for ref (in that case, it must contain at least one sequence).

  • ref – an Align instance providing an external alignment. See avobe for more details. ref must contain at least one sequence. Sequences must be aligned.
  • data_type – one of the strings Protein, DNA or RNA. If None, let Clustal Omega guess.
  • include_ougroup – boolean indicating whether sequences from the outgroup should be included in the alignment.
  • full – use full distance matrix to determine guide tree (the default is using the faster and less memory-intensive mBed approximation).
  • full_iter – use full distance matrix to determine guide tree during iterations.
  • cluster_size – size of clusters (as a number of sequences) used in the mBed algorithm.
  • use_kimura – use Kimura correction for estimating whole-alignment distance (only available if a protein alignment has been provided as source and the option data_type has been set to Protein).
  • num_iter – number of iterations allowing to improve the quality of the alignment. Must be a number ≥1 or a pair of numbers ≥1. If the value is a pair of numbers, they specify the number of guide tree iterations and hidden Markov model iterations, respectively. If a single value is provided, iterations couple guide tree and hidden Markov model.
  • threads – number of threads for parallelization (available for parts of the program).
  • keep_order – return the sequences in the same order as they were loaded.
  • verbose – display Clustal Omega’s console output.
Returns:

An Align instance containing aligned sequences.

Changed in version 3.0.0: Ported to Clustal Omega and added support for more options.

egglib.wrappers.muscle(source, ref=None, include_outgroup=True, verbose=False, **kwargs)

Performs multiple alignment using Muscle.

Muscle’s default options tend to produce high-quality alignments but may be long to run on large data set. Muscle’s author recommends using the option maxiters=2 for large data sets, and, for fast alignment (in particular of closely related sequences): maxiters=1 diags=True aa_profile='sv' distance1='kbit20_3' (for amino acid sequences) and maxiters=1 diags=True (for nucleotide sequences).

Parameters:
  • source – a Container or Align containing sequences to align. If an Align is provided, sequences are assumed to be already aligned and alignment will be refined (using the -refine option of Muscle), unless an alignment is also provided as ref. In the latter case, the two alignments are preversed (their columns are left unchanged), and they aligned with respect to each other.
  • ref – an Align instance providing an alignement that should be aligned with respect to the alignment provided as source. If ref is provided, it is required both source and ref are Align instances.
  • include_outgroup – boolean indicating whether sequences from the outgroup should be included in the alignment.
  • verbose – display Muscle’s console output.
  • kwargs

    other keyword arguments are passed to Muscle. The available options are listed below:

    option value  
    anchors a boolean
    brenner a boolean
    cluster a boolean
    diags a boolean
    diags1 a boolean
    diags2 a boolean
    dimer a boolean
    teamgaps4 a boolean
    SUEFF a float
    aa_profile one of: le, sp, sv
    anchorspacing an integer
    center a float
    cluster1 one of: upgma, upgmb, neighborjoining
    diagbreak an integer
    diaglength an integer
    diagmargin an integer
    distance1 one of: kmer6_6, kmer20_3, kmer20_4, kbit20_3, kmer4_6
    distance2 one of: pctidkimura, pctidlog
    gapopen a float
    hydro an integer
    hydrofactor a float
    maxiters an integer
    maxtrees an integer
    minbestcolscore a float
    minsmoothscore a float
    nt_profile one of: spn
    objscore one of: sp, ps, dp, xp, spf, spm
    refinewindow an integer
    root1 one of: pseudo, midlongestspan, minavgleafdist
    seqtype one of: protein, dna, rna, auto
    smoothscoreceil a float
    weight1 one of: none, henikoff, henikoffpb, gsc, clustalw, threeway
    weight2 one of: none, henikoff, henikoffpb, gsc, clustalw, threeway

    For a description of options, see the Muscle manual. Most of Muscle’s options are available. Note that function takes no flag option, and Muscle’s flag options are passed as boolean keyword arguments (except options relative to the amino acid or nucleotide profile score options, that are passed as string as aa_profile and nt_profile, respectively. The order of options is preserved.

Returns:

An Align containing aligned sequences.

Changed in version 3.0.0: Added support for most options.