Export data in the format used by the ms program (Hudson 2002 Bioinformatics 18: 337-338). The original format is designed for binary (0/1) allelic values. This implementation will export any allelic values that are present in the provided alignment (but always as integers; if sequence data is included, they will be represented by their corresponding integer values). In addition, it is possible to insert spaces between all loci to accomodate allelic values exceeding the range [0,9] (outside this range, it is not possible anymore to discriminate loci using the standard format). See the spacer option.
The format is as follows:
One line with two slashes.
One line with the number of sites
sites is zero.
of sites is larger than zero.
Here is the description of the fasta format used in EggLib:
Iterative sequence-by-sequence fasta parser. Return an object that can be iterated over:
for item in egglib.io.fasta_iter(fname): do things
fasta_iter objects support the with statement:
with egglib.io.fasta_iter(fname) as f: for item in f: do things
Each iteration yields a SampleView instance (which is valid only during the iteration round, see the warning below). It is also possible to iterate manually using next(). The number of groups is defined by the current sample (if the number of defined groups varies among samples, it is reset at each iteration).
The aim of this iterator is to iterator over large fasta files without actually storing all data in memory at the same time. SampleView provided for each iteration are a proxy to a local Container instance that is recycled at each iteration step. They should be used immediately and never stored as this. If one want to store data accessible through any SampleView, they should copy this data to another data structure (typically using Container.add_sample()).
New in version 3.0.0.
Read General Feature Format (GFF)-formatted (version 3) genome annotation data from a file specified by name or from a provided string.
See the description of the GFF3 format http://www.sequenceontology.org/gff3.shtml.
This class supports segmented features but only if they are consecutive in the file. All features are loaded into memory and can be processed interactively.
The liberal argument allows to support a few violations from the canonical GFF3 format. The current list of violations is:
The list of supported violations may change in the future between minor versions of EggLib. It is recommended to use liberal=True only for exploratory analyses. Otherwise, it is better to fix the original file so that it complies with the format.
New in version 3.0.0.
Return an iterator over features, returned as GFF3Feature instances. Only top features complying with arguments are considered.
If seqid is not valid for this data file, raise a ValueError.
Metadata of the imported file (information present in the file header). Metadata are available as a list of (key, value) tuple() ‘s. It is not possible to replace the list by another object, but it is allowed to modify the returned object.
Number of features that are directly accessible (that is, those who don’t have a parent).
Total number of features of the input file, included lower-level features and features that have not been indexed, if any.
Provide information related to a given feature. Currently, instances cannot be created by the user and are read-only. Data are available as read-only properties; some of these are lists and can be modified but it does not make any sense to do so.
Value of the ID attribute, or None if this attribute was not defined.
List of Alias attributes.
List of non-predefined attributes, as (key, items) tuples, where items is itself a list.
List of Dbxref attributes.
Value of the Derives_from attribute, or None if this attribute was not defined.
Value of the Gap attribute, or None if this attribute was not defined.
Get a part (descending feature), as a GFF3Feature instance. The index should be within range.
Value of the Is_circular attribute, as a boolean.
Value of the Name attribute, or None if this attribute was not defined.
List of Note attributes.
Number of fragments.
Number of parents.
Number of parts (descending features).
List of Ontology_term attributes.
Value of the Phase attribute. Possible values are: 0, 1, 2 and None (if undefined).
List of start/end positions of all fragments.
Value of the Score attribute, or None if this attribute was not defined.
Value of seqid for this feature.
Source of this feature.
Value of The Target attribute, or None if this attribute was not defined. The target value is no processed and is provided as a single string.
Type of this feature.
Read Variant Call Format (VCF)-formatted data for genomic polymorphism information from a file specified by name or from a strings.
See the description of the VCF format.
There are two ways to process VCF data: one is using a static file that is iteratively parsed, using the standard constructor VcfParser(fname) and then iterate over lines in a for loop (alternatively, one can use VcfParser.next() directly), and the other way is to use the class factory method VcfParser.from_header(string) and then feed manually each line as a string using VcfParser.read_line(string).
VcfParser instances are iterable (with support for the for statement and the next() method) only if they are created with a file to process. Otherwise they must be fed line-by-line with the read_line() method. Every loop in a for loop or call to next() or read_line() yields a (chromosome, position, num_all) tuples that allows the user to determines if the variant is of interest. If so, the VcfParser object provides methods to extract all data for this variant (which can be time-consuming and should be restricted to pre-filtered lines to improve efficiency.
New in version 3.0.0.
File format present in the read header.
Create and return a new VcfParser instance reading the header passed as the string argument.
There are two ways to use this method:
This method requires that all parsers have been used to process valid data. If the AA field is present in the processed parsers, its value is imported as outgroup. If get_genotypes is True, the ancestral genotype is assumed to be be homozygote for the given ancestral allele. If get_genotypes is False, the ancestral allele is loaded only once in the outgroup.
A Site instance by default, or None if dest was specified.
Get data for a given META field defined in the VCF header. The passed index must be smaller than num_meta. Return a tuple containing the key and the value of the META field.
Get the name of a sample read from the header. The passed index must be smaller than num_samples.
Return a Variant instance containing all data available for the last variant processed by this instance. It is required that a variant has been effectively processed.
Read one variant. Raise a StopIteration exception if no data is available.
|Returns:||The same as an iteration loop (see class description).|
Number of defined ALT fields.
Number of defined FILTER fields.
Number of defined FORMAT fields.
Number of defined INFO fields.
Number of defined META fields.
Number of samples read from header.
Read one variant from a user-provided single line. The string should contain a single line of VCF-formatted data (no header). All field specifications and sample information should be consistent with the information contained in the header that has been provided at creation-time to this instance (whichever it was read from a file or also provided as a string).
|Returns:||The same as an iteration loop (see class description).|
Represent a single variant (one line from a VCF-formatted data file). The user cannot create instances of this class himself (instances are generated by VcfParser) and instances are not modifiable in principle (however, some attributes provide mutable objects, as mentioned).
The AA (ancestral allele), AN (allele number), AC (allele count), and AF (allele frequency) INFO fields as well as the GT (deduced genotype) FORMAT are automatically extracted if they are present in the the file and if their definition matches the format specification (meaning that they were not re-defined with different number/type) in the header. If present, they are available through the dedicated attributes AN, AA, AC, AF, GT, GT_ploidy and GT_phased. However, they are still available in the respective info and samples (sub)-dictionaries.
Value of the AA info field (None if missing).
Value of the AC info field, as a tuple (None if missing).
Value of the AF info field, as a tuple (None if missing).
Value of the AN info field (None if missing).
Genotypes from GT fields (only if this format field is available), provided as a tuple of sub-tuples. The number of sub-tuples is equal to the number of samples (num_samples). The number of items within each sub-tuples is equal to the ploidy (GT_ploidy). These items are allele expression (as found in alleles), or None (for missing values). This attribute is None if GT is not available.
Boolean indicating whether the genotype for each sample is phased (None if GT is not available).
Ploidy among genotypes (None if GT is not available).
Tuple containing all IDs (even if just one or none).
Variant alleles (the first is the reference and is not guaranteed to be present in samples), as a tuple.
Alternate allele symbolizing a breakend (see VCF description for more details).
Explicit alternate allele (the string represents the nucleotide sequence of the allele).
Alternate allele referring to a pre-defined allele (the string provides the ID of the allele).
Alternate alleles types, as a tuple. One value is provided for each alternate allele. The provided values are integers whose values should always be compared to class attributes alt_type_default, alt_type_referred and alt_type_breakend, as in (for the type of the first alternate allele):
type_ = variant.alternate_types if type_ == variant.alt_type_default: allele = variant.allele(0)
Chromosome name (None if missing).
Named of filters at which this variant failed, as a tuple (None if no filters applied).
Available FORMAT fields ID’s available for each sample, as a frozenset (empty if no sample data is available).
Dictionary of INFO fields for this variant. Keys are ID of INFO fields available for this variant, and values are always a tuple of items. For flag INFO types, the value is always an empty tuple.
Number of alleles (including the reference in all cases).
Number of samples (equivalent to len(Variant.samples)).
Position (None if missing).
Variant quality (None if missing).
Imports a clustal-formatted alignment. The input format is the one generated and used by CLUSTALW (see http://web.mit.edu/meme_v4.9.0/doc/clustalw-format.html).
|Parameters:||string – input clustal-formatted sequence alignment.|
|Returns:||A new Align instance.|
Changed in version 3.0.0: Renamed (previous name was aln2fas()). Input argument is a string rather than a file.
Import the output file of the GAP4 program of the Staden package.
The input file should have been generated from a contig alignment by the GAP4 contig editor, using the command “dump contig to file”. The sequence named CONSENSUS, if present, is automatically removed unless the option delete_consensus is False.
Staden’s default convention is followed:
New in version 2.0.1: Add argument delete_consensus.
Changed in version 2.1.0: Read from string or fname.
Changed in version 3.0.0: Renamed from_staden(). Only string input is supported now.
Converts Genalys-formatted sequence alignment files to fasta. This function imports files generated through the option Save SNPs of Genalys 2.8.
|Parameters:||string – input data as a Genalys-formatted string.|
|Returns:||An Align instance.|
Changed in version 3.0.0: Renamed from_genalys(). Only string input is supported now.
Imports fgenesh output.
|Parameters:||fname – a string containing fgenesh ouput.|
|Parma locus:||locus name.|
|Returns:||A list of gene and CDS features represented by dictionaries. Note that 5’ partial features might not be in the appropriate frame and that it can be necessary to add a codon_start qualifier.|
Changed in version 3.0.0: Input as string. Added locus argument.
This class represents a GenBank-formatted DNA sequence record.
Only one of the two arguments fname and string can be non-None. If both are None, the constructor generates an empty instance with sequence of length 0. If fname is non-None, a GenBank record is read from the file with this name. If string is non-None, a GenBank record is read directly from this string. The following variables are read from the parsed input if present: accession, definition, title, version, GI, keywords, source, references (which is a list), locus and others. Their default value is None except for references and others for which default is an empty list. source is a (description, species, taxonomy) tuple. Each of references is a (header, raw reference) tuple and each of others is a (key, raw) tuple.
In addition to methods documented below, the following operations are supported for gb if it is a GenBank instance:
|len(gb)||Length of the sequence attached to this record|
|str(gb)||GenBank representation of the record|
|for feat in gb||Iterate over GenBankFeature instances of this record|
Add a feature to the instance. The argument feature must be a well-formed GenBankFeature instance.
Return a new GenBank instance representing a subset of the current instance, from position from_pos to to_pos. All (and only) features that are completely included in the specified range are exported.
Give the number of features contained in the instance.
Reverse-complement the instance (in place). All features positions and the sequence will be reverted and applied to the complementary strand. The features will be sorted in increasing start position (after reverting). This method should be applied only on genuine nucleotide sequences.
Sequence string (can be modified). Note that changing the record’s string might obsolete the features (meaning that the setting an invalid sequence might cause the features to point to incorrect or out-of-bounds regions of the sequence).
Create a file named fname and write the formatted record in.
Writes the content of the instance as a Genbank-formatted string within the passed file (or file-compatible) stream.
Instances of this class represent features associated to a GenBank instance. They should not be instantiated or used separatedly of a GenBank instance. The constructor creates an empty instance (although a GenBank instance must be passed as parent) and either set() or parse() must be used subsequently.
|Parameters:||parent – a GenBank instance to which the feature should be attached.|
In addition to methods documented below, the following operations are supported for feat if it is a GenBankFeature instance:
|str(feat)||GenBank representation of the feature|
Add a qualifier to the instance’s qualifiers.
First position of the first (or unique) segment, in such a way that start() is always smaller than stop().
Last position of the last (or unique) segment, in such a way that start() is always smaller than stop().
Return the type string of the instance.
Update feature information from information read in a GenBank-formatted string.
Return a dictionary with all qualifier values. This method cannot be used to change data within the instance. Note that changes of the returned dictionary don’t affect data contained in the instance.
Changed in version 2.1.0: Meaning changed.
Reverse-complement the feature: apply it to the complement strand and reverse positions counting from the end. The length argument specifies the length of the complete sequence and is usually not required.
Shift all positions according to the (positive of negative) argument.
Update feature information.
Hold the location of a GenBank feature. Supports various forms of location as defined in the GenBank format specification. The constructor contains a parser working from a GenBank-formatted string. By default, features are on the forward strand and segmented features are ranges (not orders).
In addition to methods documented below, the following operations are supported for loc if it is a GenBankFeatureLocation instance:
|len(loc)||Number of segments|
|loc[index]||Return the (fist, last) tuple for the corresponding segment|
|for (first, last) in loc||Iterator over segments|
|str(params)||Generate a GenBank representation|
GenBankFeatureLocation supports iteration and allows to iterate over (first,last) segments regardless of their types (for a single-base segment a position position, the tuple (position,position) is returned; similar 2-item tuples are returned for other types of segment as well).
Add a segment corresponding to a single base chosen within a base range. If no segments were previously enter, set the unique segment location. first and last must be integers. The feature will be set between first and last positions, including both limits. If the feature is intended to be placed on the complement strand between positions, say, 1127 and 1482, one must use add_base_choice(1127,1482) in combination with set_complement(). All entered positions must be larger than any positions entered previously and last must be strictly larger than first. left_partial and/or right_partial must be set to True if, respectively, the real start of the segment lies 5’ of first and/or the real end of the segment lies beyond last (relatively to the forward strand and consistently with the numbering system).
Add a base range the feature. If no segments were previously enter, set the unique segment location. first and last must be integers. The feature will be set between first and last positions, including both limits. If the feature is intended to be placed on the complement strand between positions, say, 1127 and 1482, one must use add_base_range(1127,1482) in combination with set_complement(). All entered positions must be larger than any positions entered previously and last must be larger than first (but can be equal). left_partial and/or right_partial must be set to True if, respectively, the real start of the segment lies 5’ of first and/or the real end of the segment lies beyond last (relatively to the forward strand and consistently with the numbering system).
Add a segment lying between two consecutive bases. If no segments were entered previously, set the unique segment location. position must be an integer. The feature will be set between position and position + 1. If the feature is intended to be placed on the complement strand between positions, say, 1127 and 1128, one must use add_between_base(1127) in combination with set_complement(). All entered positions must be larger than any positions entered previously.
Add a single-base segment to the feature. If no segments were entered previously, set the unique segment location. position must be an integer. All entered positions must be larger than any positions entered previously.
Define the feature as an order instead of a range.
Define the features as a range, with is the default.
Return a deep copy of the current instance.
True if the feature is on the complement strand.
True if the feature is a range (the default), False if it is an order.
Reverse the feature positions: positions are modified to be counted from the end. The length of the complete sequence must be passed.
Place the feature on the complement strand.
Place the feature on the forward (not complement) strand, which is the default.
Shift all positions according to the (positive of negative) argument.