Export data in the format used by the ms program (Hudson 2002 Bioinformatics 18: 337-338). The original format is designed for binary (0/1) allelic values. This implementation will export any allelic values that are present in the provided alignment (but always as integers; if sequence data is included, they will be represented by their corresponding integer values). In addition, it is possible to insert spaces between all loci to accomodate allelic values exceeding the range [0,9] (outside this range, it is not possible anymore to discriminate loci using the standard format). See the spacer option.
The format is as follows:
One line with two slashes.
One line with the number of sites
sites is zero.
of sites is larger than zero.
Here is the description of the fasta format used in EggLib:
Iterative sequence-by-sequence fasta parser. Return an object that can be iterated over:
for item in egglib.io.fasta_iter(fname): do things
fasta_iter objects support the with statement:
with egglib.io.fasta_iter(fname) as f: for item in f: do things
Each iteration yields a SampleView instance (which is valid only during the iteration round, see the warning below). It is also possible to iterate manually using next(). The number of groups is defined by the current sample (if the number of defined groups varies among samples, it is reset at each iteration).
The aim of this iterator is to iterator over large fasta files without actually storing all data in memory at the same time. SampleView provided for each iteration are a proxy to a local Container instance that is recycled at each iteration step. They should be used immediately and never stored as this. If one want to store data accessible through any SampleView, they should copy this data to another data structure (typically using Container.add_sample()).
New in version 3.0.0.
Read General Feature Format (GFF)-formatted (version 3) genome annotation data from a file specified by name or from a provided string.
See the description of the GFF3 format http://www.sequenceontology.org/gff3.shtml.
This class supports segmented features but only if they are consecutive in the file. All features are loaded into memory and can be processed interactively.
The liberal argument allows to support a few violations from the canonical GFF3 format. The current list of violations is:
The list of supported violations may change in the future between minor versions of EggLib. It is recommended to use liberal=True only for exploratory analyses. Otherwise, it is better to fix the original file so that it complies with the format.
New in version 3.0.0.
Return an iterator over features, returned as GFF3Feature instances. Only top features complying with arguments are considered.
If seqid is not valid for this data file, raise a ValueError.
Metadata of the imported file (information present in the file header). Metadata are available as a list of (key, value) tuple() ‘s. It is not possible to replace the list by another object, but it is allowed to modify the returned object.
Number of features that are directly accessible (that is, those who don’t have a parent).
Total number of features of the input file, included lower-level features and features that have not been indexed, if any.
Provide information related to a given feature. Currently, instances cannot be created by the user and are read-only. Data are available as read-only properties; some of these are lists and can be modified but it does not make any sense to do so.
Value of the ID attribute, or None if this attribute was not defined.
List of Alias attributes.
List of non-predefined attributes, as (key, items) tuples, where items is itself a list.
List of Dbxref attributes.
Value of the Derives_from attribute, or None if this attribute was not defined.
Value of the Gap attribute, or None if this attribute was not defined.
Get a part (descending feature), as a GFF3Feature instance. The index should be within range.
Value of the Is_circular attribute, as a boolean.
Value of the Name attribute, or None if this attribute was not defined.
List of Note attributes.
Number of fragments.
Number of parents.
Number of parts (descending features).
List of Ontology_term attributes.
Value of the Phase attribute. Possible values are: 0, 1, 2 and None (if undefined).
List of start/end positions of all fragments.
Value of the Score attribute, or None if this attribute was not defined.
Value of seqid for this feature.
Source of this feature.
Value of The Target attribute, or None if this attribute was not defined. The target value is no processed and is provided as a single string.
Type of this feature.
Read Variant Call Format (VCF)-formatted data for genomic polymorphism information from a file specified by name or from a strings.
See the description of the VCF format.
There are two ways to process VCF data: one is using a static file that is iteratively parsed, using the standard constructor VcfParser(fname) and then iterate over lines in a for loop (alternatively, one can use VcfParser.next() directly), and the other way is to use the class factory method VcfParser.from_header(string) and then feed manually each line as a string using VcfParser.read_line(string).
VcfParser instances are iterable (with support for the for statement and the next() method) only if they are created with a file to process. Otherwise they must be fed line-by-line with the read_line() method. Every loop in a for loop or call to next() or read_line() yields a (chromosome, position, num_all) tuples that allows the user to determines if the variant is of interest. Note that the position is considered as an index and therefore has been decremented compared with the value found in the file.
If the variant is of interest, the VcfParser object provides methods to
extract all data for this variant (which can be time-consuming and should be restricted to pre-filtered lines to improve efficiency.
New in version 3.0.0.
To use the method bed_slider, you must load a bed file with the method ‘get_bed_file’ before.
File format present in the read header.
Create and return a new VcfParser instance reading the header passed as the string argument.
This method allows to load a ”.bed” file in an object “VcfParser”. In a “VcfParser” a bed file is used to get somes variants in a ”.vcf” file according to chromosomals coordinates saved in a ”.bed” file loaded.
Gets the index of the first variant of a chromosome inf the index file. :param chromosome: This function expects the name of a desired chromosome (string) Beware: an index file must be loaded in VcfIndex instance, with the function “set_index_file”
before use the method “get_contigu_index
There are two ways to use this method:
This method requires that all parsers have been used to process valid data. If the AA field is present in the processed parsers, its value is imported as outgroup. If get_genotypes is True, the ancestral genotype is assumed to be be homozygote for the given ancestral allele. If get_genotypes is False, the ancestral allele is loaded only once in the outgroup.
A Site instance by default, or None if dest was specified.
Get the start stream position of the first variant linked to the last chromosome. :beware : A index file must be loaded in the object “VcfIndex”, with the function “set_index_file”
before use the method “get_contigu_index
Get data for a given META field defined in the VCF header. The passed index must be smaller than num_meta. Return a tuple containing the key and the value of the META field.
Gets the index of a variant according to a chromosome and a chromosomal position. :param chromosome: This function expects the name of a desired chromosome (string) :param position: This function expects an chromosomal position, linked to the chromosome passed as argument
according the Index file. (int)
|Beware:||An index file must be loaded in the object “VcfIndex”, with the function “set_index_file” before use the method “get_contigu_index|
Get the name of a sample read from the header. The passed index must be smaller than num_samples.
Gets the index of a variant according to an indice (variant’s position in the vcf file). :param line: This function expects a line number smaller of equal of the number
of indice in the file index loaded. (int)
Tell if the file is good for reading (available valid stream, and not end of file).
This method allows to move a VcfParser instance at a specific position in the vcf file according to a chromosome and a chromosomal position. :param chromosome: a string with a name of the chromosome desired :param position: a int with a chromosomal position linked to the chromosome. If only a chromosome is passed as argument or the position argument is None, the VcfParser instance will moved at the position of the first variant of this chromosome in the VcfFile.
Load an index file in a VcfParser instance allows to increase the execution’s speed of the method VcfWindow.goto()
Checks if the :class:VcfParser` instance has an index file linked loaded.
Return a Variant instance containing all data available for the last variant processed by this instance. It is required that a variant has been effectively processed.
This method allows to create a Index File. :param output: Name of the index file created.If the output argument is None,
the created Index file will be named with a default name. The extension of an Index file is “vcfi”.
|Parameters:||load – if True, the VcfParser will loaded the index file created in the VcfIndex instance, else no data will be loaded after its creation. call VcfWindow.next()|
Read one variant. Raise a StopIteration exception if no data is available.
|Returns:||The same as an iteration loop (see class description).|
Number of defined ALT fields.
Number of defined FILTER fields.
Number of defined FORMAT fields.
Gets the number of index loaded in the :class:VcfParser` instance.
Number of defined INFO fields.
Number of defined META fields.
Number of samples read from header.
An index file allows to increase the execution’s speed of the progression of the :class:VcfParser` instance on the variants of a Vcf file. In Fact this binary file contains all start line position of each variant, linked to the chromosome and chromosomal position. That’s allows to move the parser at a specific position according to a chromosome or/and chromosomal position given.
This method allows to find, read and load as an VcfIndex an Index file linked to the current :class:VcfParser` instance. The search of the index file, is done with a default name generated from the name of the Vcf file loaded in the current VcfParser.
|Parameters:||fname – name of a properly formatted Index file. The Index file must be linked to the vcf data of the :class:VcfParser` instance. In fact in the header on an index file, there are printed the “EOF” index of the vcf file behind it. The “EOF” indexes, of the index file and of the Vcf file passed in the VcfParser, must match. EOF*: End Of File.|
Read one variant from a user-provided single line. The string should contain a single line of VCF-formatted data (no header). All field specifications and sample information should be consistent with the information contained in the header that has been provided at creation-time to this instance (whichever it was read from a file or also provided as a string).
|Returns:||The same as an iteration loop (see class description).|
This method allows to move a VcfParser instance to the first variant of the Vcffile loaded.
To use an object of the VcfWindow in a loop statement, maintain the parameter fill False.
This methods allows to unread the last variant read by the VcfParser with the method ‘next’
Represent a single variant (one line from a VCF-formatted data file). The user cannot create instances of this class himself (instances are generated by VcfParser) and instances are not modifiable in principle (however, some attributes provide mutable objects, as mentioned).
The AA (ancestral allele), AN (allele number), AC (allele count), and AF (allele frequency) INFO fields as well as the GT (deduced genotype) FORMAT are automatically extracted if they are present in the the file and if their definition matches the format specification (meaning that they were not re-defined with different number/type) in the header. If present, they are available through the dedicated attributes AN, AA, AC, AF, GT, GT_ploidy and GT_phased. However, they are still available in the respective info and samples (sub)-dictionaries.
Value of the AA info field (None if missing).
Value of the AC info field, as a tuple (None if missing).
Value of the AF info field, as a tuple (None if missing).
Value of the AN info field (None if missing).
Genotypes from GT fields (only if this format field is available), provided as a tuple of sub-tuples. The number of sub-tuples is equal to the number of samples (num_samples). The number of items within each sub-tuples is equal to the ploidy (GT_ploidy). These items are allele expression (as found in alleles), or None (for missing values). This attribute is None if GT is not available.
Boolean indicating whether the genotype for each sample is phased (None if GT is not available).
Ploidy among genotypes (None if GT is not available).
GT field as written in a vcf file
Tuple containing all IDs (even if just one or none).
Variant alleles (the first is the reference and is not guaranteed to be present in samples), as a tuple.
Alternate allele symbolizing a breakend (see VCF description for more details).
Explicit alternate allele (the string represents the nucleotide sequence of the allele).
Alternate allele referring to a pre-defined allele (the string provides the ID of the allele).
Alternate alleles types, as a tuple. One value is provided for each alternate allele. The provided values are integers whose values should always be compared to class attributes alt_type_default, alt_type_referred and alt_type_breakend, as in (for the type of the first alternate allele):
type_ = variant.alternate_types if type_ == variant.alt_type_default: allele = variant.allele(0)rewind
Chromosome name (None if missing).
Named of filters at which this variant failed, as a tuple (None if no filters applied).
Available FORMAT fields ID’s available for each sample, as a frozenset (empty if no sample data is available).
Dictionary of INFO fields for this variant. Keys are ID of INFO fields available for this variant, and values are always a tuple of items. For flag INFO types, the value is always an empty tuple.
Number of alleles (including the reference in all cases).
Number of samples (equivalent to len(Variant.samples)).
Position (as an index; first value is 0) (None if missing).
Variant quality (None if missing).
This class allows to create a sliding Window on a VcfParser instance. But this class cannot be called directly. So to create a sliding window you must call the method VcfParser.slider() of the VcfParser.
VcfWindow instances are iterable by the special _iter__() (with support ‘for’ statement). The iteration is done on the sites of the current sliding window. Each itteration return a stats._site.Site instance. VcfWindow instances are slices objects (usable by the next command: self[key]). The method __getitem__() allows the access by index. This method gets the site at given index. Negatives values are not allowed, else returns a ‘ValueError’.
This method allows to print all variables used to configure the sliding window
Configure a sliding window
an object of the VcfWindow.
Allows to move forward the sliding window, from the initial position of the vcfparser to a chromosomial position passed as parameter ‘stop’ in the configuration method, until the last variant of the VcfParser instance or of the chromosome read by the sliding window. This progression depends of the size window and the increment. To start the sliding window at a specific position, use the method VcfParser.goto() method of the VcfParser before call the the method VcfParser.slider() of the VcfParser
Get the size of the current slidding window in pairs bases. This size is calculated as the difference between the chromosomal position of the first and the last variant of the VcfWindow instance.
Imports a clustal-formatted alignment. The input format is the one generated and used by CLUSTALW (see http://web.mit.edu/meme_v4.9.0/doc/clustalw-format.html).
|Parameters:||string – input clustal-formatted sequence alignment.|
|Returns:||A new Align instance.|
Changed in version 3.0.0: Renamed (previous name was aln2fas()). Input argument is a string rather than a file.
Import the output file of the GAP4 program of the Staden package.
The input file should have been generated from a contig alignment by the GAP4 contig editor, using the command “dump contig to file”. The sequence named CONSENSUS, if present, is automatically removed unless the option delete_consensus is False.
Staden’s default convention is followed:
New in version 2.0.1: Add argument delete_consensus.
Changed in version 2.1.0: Read from string or fname.
Changed in version 3.0.0: Renamed from_staden(). Only string input is supported now.
Converts Genalys-formatted sequence alignment files to fasta. This function imports files generated through the option Save SNPs of Genalys 2.8.
|Parameters:||string – input data as a Genalys-formatted string.|
|Returns:||An Align instance.|
Changed in version 3.0.0: Renamed from_genalys(). Only string input is supported now.
Imports fgenesh output.
|Parameters:||fname – a string containing fgenesh ouput.|
|Parma locus:||locus name.|
|Returns:||A list of gene and CDS features represented by dictionaries. Note that 5’ partial features might not be in the appropriate frame and that it can be necessary to add a codon_start qualifier.|
Changed in version 3.0.0: Input as string. Added locus argument.
This class represents a GenBank-formatted DNA sequence record.
Only one of the two arguments fname and string can be non-None. If both are None, the constructor generates an empty instance with sequence of length 0. If fname is non-None, a GenBank record is read from the file with this name. If string is non-None, a GenBank record is read directly from this string. The following variables are read from the parsed input if present: accession, definition, title, version, GI, keywords, source, references (which is a list), locus and others. Their default value is None except for references and others for which default is an empty list. source is a (description, species, taxonomy) tuple. Each of references is a (header, raw reference) tuple and each of others is a (key, raw) tuple.
In addition to methods documented below, the following operations are supported for gb if it is a GenBank instance:
|len(gb)||Length of the sequence attached to this record|
|str(gb)||GenBank representation of the record|
|for feat in gb||Iterate over GenBankFeature instances of this record|
Add a feature to the instance. The argument feature must be a well-formed GenBankFeature instance.
Return a new GenBank instance representing a subset of the current instance, from position from_pos to to_pos. All (and only) features that are completely included in the specified range are exported.
Give the number of features contained in the instance.
Reverse-complement the instance (in place). All features positions and the sequence will be reverted and applied to the complementary strand. The features will be sorted in increasing start position (after reverting). This method should be applied only on genuine nucleotide sequences.
Sequence string (can be modified). Note that changing the record’s string might obsolete the features (meaning that the setting an invalid sequence might cause the features to point to incorrect or out-of-bounds regions of the sequence).
Create a file named fname and write the formatted record in.
Writes the content of the instance as a Genbank-formatted string within the passed file (or file-compatible) stream.
Instances of this class represent features associated to a GenBank instance. They should not be instantiated or used separatedly of a GenBank instance. The constructor creates an empty instance (although a GenBank instance must be passed as parent) and either set() or parse() must be used subsequently.
|Parameters:||parent – a GenBank instance to which the feature should be attached.|
In addition to methods documented below, the following operations are supported for feat if it is a GenBankFeature instance:
|str(feat)||GenBank representation of the feature|
Add a qualifier to the instance’s qualifiers.
First position of the first (or unique) segment, in such a way that start() is always smaller than stop().
Last position of the last (or unique) segment, in such a way that start() is always smaller than stop().
Return the type string of the instance.
Update feature information from information read in a GenBank-formatted string.
Return a dictionary with all qualifier values. This method cannot be used to change data within the instance. Note that changes of the returned dictionary don’t affect data contained in the instance.
Changed in version 2.1.0: Meaning changed.
Reverse-complement the feature: apply it to the complement strand and reverse positions counting from the end. The length argument specifies the length of the complete sequence and is usually not required.
Shift all positions according to the (positive of negative) argument.
Update feature information.
Hold the location of a GenBank feature. Supports various forms of location as defined in the GenBank format specification. The constructor contains a parser working from a GenBank-formatted string. By default, features are on the forward strand and segmented features are ranges (not orders).
In addition to methods documented below, the following operations are supported for loc if it is a GenBankFeatureLocation instance:
|len(loc)||Number of segments|
|loc[index]||Return the (fist, last) tuple for the corresponding segment|
|for (first, last) in loc||Iterator over segments|
|str(params)||Generate a GenBank representation|
GenBankFeatureLocation supports iteration and allows to iterate over (first,last) segments regardless of their types (for a single-base segment a position position, the tuple (position,position) is returned; similar 2-item tuples are returned for other types of segment as well).
Add a segment corresponding to a single base chosen within a base range. If no segments were previously enter, set the unique segment location. first and last must be integers. The feature will be set between first and last positions, including both limits. If the feature is intended to be placed on the complement strand between positions, say, 1127 and 1482, one must use add_base_choice(1127,1482) in combination with set_complement(). All entered positions must be larger than any positions entered previously and last must be strictly larger than first. left_partial and/or right_partial must be set to True if, respectively, the real start of the segment lies 5’ of first and/or the real end of the segment lies beyond last (relatively to the forward strand and consistently with the numbering system).
Add a base range the feature. If no segments were previously enter, set the unique segment location. first and last must be integers. The feature will be set between first and last positions, including both limits. If the feature is intended to be placed on the complement strand between positions, say, 1127 and 1482, one must use add_base_range(1127,1482) in combination with set_complement(). All entered positions must be larger than any positions entered previously and last must be larger than first (but can be equal). left_partial and/or right_partial must be set to True if, respectively, the real start of the segment lies 5’ of first and/or the real end of the segment lies beyond last (relatively to the forward strand and consistently with the numbering system).
Add a segment lying between two consecutive bases. If no segments were entered previously, set the unique segment location. position must be an integer. The feature will be set between position and position + 1. If the feature is intended to be placed on the complement strand between positions, say, 1127 and 1128, one must use add_between_base(1127) in combination with set_complement(). All entered positions must be larger than any positions entered previously.
Add a single-base segment to the feature. If no segments were entered previously, set the unique segment location. position must be an integer. All entered positions must be larger than any positions entered previously.
Define the feature as an order instead of a range.
Define the features as a range, with is the default.
Return a deep copy of the current instance.
True if the feature is on the complement strand.
True if the feature is a range (the default), False if it is an order.
Reverse the feature positions: positions are modified to be counted from the end. The length of the complete sequence must be passed.
Place the feature on the complement strand.
Place the feature on the forward (not complement) strand, which is the default.
Shift all positions according to the (positive of negative) argument.