EggLib

Table Of Contents

Previous topic

Generic tools

Next topic

Diversity statistics

This Page

Import/export utilities

ms format

egglib.io.to_ms(data, fname=None, positions=None, spacer=None, include_outgroup=False, recode=False)

Export data in the format used by the ms program (Hudson 2002 Bioinformatics 18: 337-338). The original format is designed for binary (0/1) allelic values. This implementation will export any allelic values that are present in the provided alignment (but always as integers; if sequence data is included, they will be represented by their corresponding integer values). In addition, it is possible to insert spaces between all loci to accomodate allelic values exceeding the range [0,9] (outside this range, it is not possible anymore to discriminate loci using the standard format). See the spacer option.

Parameters:
  • data – Alignments to export, either as a Align instance or as an iterable of Align instances. In the latter case, all instances are exported consecutively.
  • fname – Name of the file to export data to. By default, the file is created (or overwritten if it already exists). If the option append is True, data is appended at the end of the file (and it must exist). If fname is None (default), no file is created and the formatted data is returned as a str. In the alternative case, nothing is returned.
  • positions – The list of site positions, with length matching the alignment length. Positions are required to be in the [0,1] range but the order is not checked. By default (if the argument value is None, sites are supposed to be evenly spread over the [0,1] interval. The value for this argument should match exactly the value for the argument data: if data is a single Align, positions should be a single list of positions and if data is a list of Align instances (even if the length of this list is one), then positions should be a list (of lists of positions) of the same length. If a list is provided, any of its items (supposed to represent a list of positions) can be replaced by None.
  • spacer – Define if a space must be inserted between each allelic value. If None, the space is inserted only if at least one allele at any locus is outside the range [0,9]. If True, the space is always inserted. If False, the space is not inserted. The automatic detection of out-of-range allelic values comes with the cost of increased running time.
  • include_outgroup – A boolean: if True, the outgroup is exported after the ingroup. Otherwise, the outgroup is skipped.
  • recode – If True, all allelic values are recoded such as the first encountered values is 0, the second is 1, and so on. The original alignments are left unmodified.

The format is as follows:

  • One line with two slashes.

  • One line with the number of sites

  • One line with the positions, or an empty line if the number of

    sites is zero.

  • The matrix of genotypes (one line per sample), only if the number

    of sites is larger than zero.

Fasta format

Here is the description of the fasta format used in EggLib:

  • Each sequence is preceded by a header limited to a single line and starting by a > character.
  • The header length is not limited and all characters are allowed but white spaces and special characters are discouraged. The header is terminated by a newline character.
  • Group labels are specified a special markup system placed at the end of the header line. The labels are specified by an at sign (@) followed by any integer value (@0, @1, @2 and so on). It is allowed to define several group labels for any sequence. In that case, integer values must be enter consecutively after the at sign, separated by commas, as in @1,3,2 for a sequence belonging to groups 1, 3 and 2 in three different grouping levels. Multiple grouping levels can be used to specify hierarchical structure, but not only (independent grouping structure can be freely specified). The markup @# (at sign or hash sign) specifies an outgroup sequence. The hash sign may be followed by a single integer to specify a unique group label. Multiple grouping levels are not allowed for the outgroup. The group labels of the ingroup and the outgroup are independent, so the same labels may be used. The at sign can be preceded by a unique space. In that case, the parser automatically discards one space before the at sign (both >name@1 and >name @1 are read as name) but if there are more than one space, additional spaces are considered to be part of the name. By default, no grouping structure is assumed and all sequences are assumed to be part of the ingroup.
  • The sequence itself continues on following lines until the next > character or the end of the file.
  • White spaces, tab and carriage returns are allowed at any position. They are ignored unless for terminating the header line. There is no limitation in length and different sequences can have different lengths.
  • Characters case is preserved and significant (although polymorphism analysis can be configured to take case-differing characters as synonyms).
egglib.io.from_fasta(source, groups=False, string=False, cls=None)

Create a new instance of either Align or Container from data read from the file whose name is provided as argument (or, if string is set to True, from the string passed as first argument).

Parameters:
  • source – name of a fasta-formatted sequence file. If the string argument is True, read source as a fasta-formatted string. If the returned type if Align, the sequence are required to be aligned.
  • groups – boolean indicating whether group labels should be imported. If so, they are not actually required to be present for each (or any) sequence. If not, the labels are ignored and considered to be part of the sequence.
  • string – boolean indicating whether the first argument is an explicit fasta-formatted string (by default, it is taken as the name of a fasta-formatted file).
  • cls – type that should be generated. Possible values are: Align (then, data must be aligned), Container or None. In the latter case, an Align is returned if data are found to be aligned or if the data set is empty, and otherwise a Container is returned.
Returns:

A new Container or Align instance depending on the value of the cls option.

class egglib.io.fasta_iter(fname, groups=False)

Iterative sequence-by-sequence fasta parser. Return an object that can be iterated over:

for item in egglib.io.fasta_iter(fname):
    do things

fasta_iter objects support the with statement:

with egglib.io.fasta_iter(fname) as f:
    for item in f:
        do things

Each iteration yields a SampleView instance (which is valid only during the iteration round, see the warning below). It is also possible to iterate manually using next(). The number of groups is defined by the current sample (if the number of defined groups varies among samples, it is reset at each iteration).

Warning

The aim of this iterator is to iterator over large fasta files without actually storing all data in memory at the same time. SampleView provided for each iteration are a proxy to a local Container instance that is recycled at each iteration step. They should be used immediately and never stored as this. If one want to store data accessible through any SampleView, they should copy this data to another data structure (typically using Container.add_sample()).

Parameters:
  • fname – name of a fasta-formatted file.
  • groups – if True, import group labels from sequence names (by default, they are considered as part of the name).

New in version 3.0.0.

next()

Perform an iteration round. Raise a StopIteration exception if the file is exhausted. The normal usage of this type of objects is with the for statement.

GFF3 format

class egglib.io.GFF3(source, from_string=False, liberal=False, only_genes=False)

Read General Feature Format (GFF)-formatted (version 3) genome annotation data from a file specified by name or from a provided string.

See the description of the GFF3 format http://www.sequenceontology.org/gff3.shtml.

This class supports segmented features but only if they are consecutive in the file. All features are loaded into memory and can be processed interactively.

Parameters:
  • source – name of GFF3-formatted data file, or a GFF3-formatted string is from_string is True.
  • from_string – if True, the first argument is a GFF3-formatted string; if False, the first argument is a file name.
  • liberal – if True, support some violations of the GFF3 format.
  • only_genes – if True, only index top-level features that have gene as type. This does not reduce memory usage, but it reduces access time.

The liberal argument allows to support a few violations from the canonical GFF3 format. The current list of violations is:

  • CDS features may lack a phase.

The list of supported violations may change in the future between minor versions of EggLib. It is recommended to use liberal=True only for exploratory analyses. Otherwise, it is better to fix the original file so that it complies with the format.

New in version 3.0.0.

feature_iter(seqid, start, end, feat_type=None)

Return an iterator over features, returned as GFF3Feature instances. Only top features complying with arguments are considered.

Parameters:
  • seqid – seqid identifier (only features associated with this seqid are yielded.
  • start – start position on the considered seqid (only features whose start position is >= this value are yielded).
  • end – end position on the considered seqid (only features whose end position is <= this value are yielded. One can use None to process features until the end of the seqid.
  • feat_type – process only features whose type is equal to this value. By default, process all features within range.

If seqid is not valid for this data file, raise a ValueError.

metadata

Metadata of the imported file (information present in the file header). Metadata are available as a list of (key, value) tuple() ‘s. It is not possible to replace the list by another object, but it is allowed to modify the returned object.

num_top_features

Number of features that are directly accessible (that is, those who don’t have a parent).

num_tot_features

Total number of features of the input file, included lower-level features and features that have not been indexed, if any.

seqid

All seqid values present in the imported file, as a frozenset instance.

types

Number of top features of each of the encountered types. This does not include lower-level features (that is, those who have a parent). Value is a dict (there is no good reaon to modify it).

class egglib.io.GFF3Feature

Provide information related to a given feature. Currently, instances cannot be created by the user and are read-only. Data are available as read-only properties; some of these are lists and can be modified but it does not make any sense to do so.

ID

Value of the ID attribute, or None if this attribute was not defined.

aliases

List of Alias attributes.

attributes

List of non-predefined attributes, as (key, items) tuples, where items is itself a list.

dbxref

List of Dbxref attributes.

derives_from

Value of the Derives_from attribute, or None if this attribute was not defined.

end

End position.

gap

Value of the Gap attribute, or None if this attribute was not defined.

get_parent(idx)

Get a parent, as a GFF3Feature instance. The index should be within range.

get_part(idx)

Get a part (descending feature), as a GFF3Feature instance. The index should be within range.

is_circular

Value of the Is_circular attribute, as a boolean.

name

Value of the Name attribute, or None if this attribute was not defined.

notes

List of Note attributes.

num_fragments

Number of fragments.

num_parents

Number of parents.

num_parts

Number of parts (descending features).

ontology_terms

List of Ontology_term attributes.

phase

Value of the Phase attribute. Possible values are: 0, 1, 2 and None (if undefined).

positions

List of start/end positions of all fragments.

score

Value of the Score attribute, or None if this attribute was not defined.

seqid

Value of seqid for this feature.

source

Source of this feature.

start

Start position.

target

Value of The Target attribute, or None if this attribute was not defined. The target value is no processed and is provided as a single string.

type

Type of this feature.

VCF format

class egglib.io.VcfParser(fname, allow_X=False, allow_gap=False, find_index=True)

Read Variant Call Format (VCF)-formatted data for genomic polymorphism information from a file specified by name or from a strings.

Parameters:
  • fname – name of a properly formatted VCF file. The header section will be processed upon instance creation, and lines will be read later, when the user iterates over the instance (or call next()).
  • allow_X – if True, the characters X and x can be used instead of a base in alternate alleles. This is not allowed in the VCF specification but some software has actually used it. If X is allowed and one is found, the alternate type will be set to an ad hoc type and the corresponding allele string will be X (regardless of the original case).
  • allow_gap – if True, the gap symbol - is accepted as a valid base for the specification of both reference and alternate alleles. This is not allowed in the VCF specification which follows a different convention to represents insertions and deletions.
  • find_index – if True, the VcfParser instance will find a index file linked to the Vcf data (by a default name), and will save it, in the attribut “_index” as :class:VcfIndex instance. If no file was found or the file found doesn’t match with the Vcf data, the attribut “_index” will be “None”.

See the description of the VCF format.

There are two ways to process VCF data: one is using a static file that is iteratively parsed, using the standard constructor VcfParser(fname) and then iterate over lines in a for loop (alternatively, one can use VcfParser.next() directly), and the other way is to use the class factory method VcfParser.from_header(string) and then feed manually each line as a string using VcfParser.read_line(string).

VcfParser instances are iterable (with support for the for statement and the next() method) only if they are created with a file to process. Otherwise they must be fed line-by-line with the read_line() method. Every loop in a for loop or call to next() or read_line() yields a (chromosome, position, num_all) tuples that allows the user to determines if the variant is of interest. Note that the position is considered as an index and therefore has been decremented compared with the value found in the file.

If the variant is of interest, the VcfParser object provides methods to

extract all data for this variant (which can be time-consuming and should be restricted to pre-filtered lines to improve efficiency.

New in version 3.0.0.

bed_slider(miss_pv, fill=False, flat_pv=False, start_pv=0, end_pv=None)
Parameters:
  • flat_pv – ignored individual level.
  • start_pv – if a subset of samples must be considered, index of the first sample to consider (by default, all samples are considered).
  • end_pv – if a subset of samples must be considered, index of the last sample to consider (by default, all samples are considered).
  • miss_pv – maximum number of missing alleles. If this proportion is processing is stopped and get_missing() returns max_missing + 1. Only missing data in this data set are considered. The starting point of the sliding window depends of the stream of the Vcf Parser passed in parameter.
  • ploidy – the ploidy must be a strictly positive number.
  • fill – if True, the sliding window will read the first window, else no data will be loaded in the VcfWindow instance before call VcfWindow.next()
Beware:

to use the method bed_slider, you must load a bed file with the method ‘get_bed_file’ before.

file_format

File format present in the read header.

classmethod from_header(string, allow_X=False, allow_gap=False)

Create and return a new VcfParser instance reading the header passed as the string argument.

Parameters:
  • string – single string including system-consistent line endings, the first line being the file format specification and the last line being the header line (starting with #CHROM). This function allows leading and trailing white spaces (spaces, tabs, empty lines).
  • allow_X – see class description.
  • allow_gap – see class description.
get_alt(idx)

Get data for a given ALT field defined in the VCF header. The passed index must be smaller than num_alt. Return a dict containing the following data:

  • id: ID string.
  • description: description string.
  • extra: all extra qualifiers, presented as a list of (key, value) tuples.
get_bed_file(fname)

This method allows to load a ”.bed” file in an object “VcfParser”. In a “VcfParser” a bed file is used to get somes variants in a ”.vcf” file according to chromosomals coordinates saved in a ”.bed” file loaded.

get_contig_index(chromosome)

Gets the index of the first variant of a chromosome inf the index file. :param chromosome: This function expects the name of a desired chromosome (string) :beware : A index file must be loaded in VcfIndex instance, with the function “set_index_file”

before use the method “get_contigu_index
get_filter(idx)

Get data for a given FILTER field defined in the VCF header. The passed index must be smaller than num_filter. Return a dict containing the following data:

  • id: ID string.
  • description: description string.
  • extra: all extra qualifiers, presented as a list of (key, value) tuples.
get_format(idx)

Get data for a given FORMAT field defined in the VCF header. The passed index must be smaller than num_format. Return a dict containing the following data:

  • id: ID string.
  • type: one of "Integer", "Float", "Character", and "String".
  • description: description string.
  • number: expected number of items. Special values are None (if undefined), "NUM_GENOTYPES" (number matching the number of genotypes for any particular variant), "NUM_ALTERNATE" (number matching the number of alternate alleles for any particular variant), and "NUM_ALLELES" (number matching the number of alleles–including the reference–for any particular variant).
  • extra: all extra qualifiers, presented as a list of (key, value) tuples.
get_genotypes([parser1, [parser2, ]]..., get_genotypes=False, dest=None)

Process genotype data loaded into one or more VcfParser instances and return them as a single Site instance.

There are two ways to use this method:

  1. As an instance method, to process a single parser, as in: parser.get_genotypes() (if parser is a VcfParser instance).
  2. As a class method, to process several parsers, as in: VcfParser.get_genotypes(parser1, parser2, parser3) (where parser1, parser2 and parser3 are three VcfParser instances).

This method requires that all parsers have been used to process valid data. If the AA field is present in the processed parsers, its value is imported as outgroup. If get_genotypes is True, the ancestral genotype is assumed to be be homozygote for the given ancestral allele. If get_genotypes is False, the ancestral allele is loaded only once in the outgroup.

Parameters:
  • parser – a VcfParser instance (only required if used as a class method). This argument can be repeated several times, but can not be passed as a keyword argument.
  • get_genotypes – if True, use genotypic data rather than allelic data.
  • dest – if specified, it must be a Site instance that will be recycled and used to place results.
Returns:

A Site instance by default, or None if dest was specified.

get_index()

This method allows to get the index of the current position of a VcfParser instance.

get_info(idx)

Get data for a given INFO field defined in the VCF header. The passed index must be smaller than num_info. Return a dict containing the following data:

  • id: ID string.
  • type: one of "Integer", "Float", "Flag", "Character", and "String".
  • description: description string.
  • number: expected number of items. Special values are None (if undefined), "NUM_GENOTYPES" (number matching the number of genotypes for any particular variant), "NUM_ALTERNATE" (number matching the number of alternate alleles for any particular variant), and "NUM_ALLELES" (number matching the number of alleles–including the reference–for any particular variant).
  • extra: all extra qualifiers, presented as a list of (key, value) tuples.
get_last_contig_index(chromosome)

Get the start stream position of the first variant linked to the last chromosome. :beware : A index file must be loaded in the object “VcfIndex”, with the function “set_index_file”

before use the method “get_contigu_index
get_meta(idx)

Get data for a given META field defined in the VCF header. The passed index must be smaller than num_meta. Return a tuple containing the key and the value of the META field.

get_position_index(chromosome, position)

Gets the index of a variant according to a chromosome and a chromosomal position. :param chromosome: This function expects the name of a desired chromosome (string) :param position: This function expects an chromosomal position, linked to the chromosome passed as argument

according the Index file. (int)
Beware:An index file must be loaded in the object “VcfIndex”, with the function “set_index_file” before use the method “get_contigu_index
get_sample(idx)

Get the name of a sample read from the header. The passed index must be smaller than num_samples.

get_variant_index(line)

Gets the index of a variant according to an indice (variant’s position in the vcf file). :param line: This function expects a line number smaller of equal of the number

of indice in the file index loaded. (int)
:beware : A index file must be loaded in VcfParser instance, with the function
“read_index” before use the method ‘get_indice_index’.
good

Tell if the file is good for reading (available valid stream, and not end of file).

goto(chromosome, position=None)

This method allows to move a VcfParser instance at a specific position in the vcf file according to a chromosome and a chromosomal position. :param chromosome: a string with a name of the chromosome desired :param position: a int with a chromosomal position linked to the chromosome.

If only a chromosome is passed as argument or the position argument is None, the
VcfParser instance will moved at the position of the first variant of this
chromosome in the VcfFile.
:: Load a index file in a VcfParser instance allows to increase the execution’s speed
of the method VcfWindow.goto()
has_index

Checks if the :class:VcfParser` instance has an index file linked loaded.

last_variant()

Return a Variant instance containing all data available for the last variant processed by this instance. It is required that a variant has been effectively processed.

make_index(fname=None, load=False)

This method allows to create a Index File. :param output: Name of the index file created.If the output argument is None,

the created Index file will be named with a default name. The extension of an Index file is “vcfi”.
Parameters:load – if True, the VcfParser will loaded the index file created in the VcfIndex instance, else no data will be loaded after its creation. call VcfWindow.next()
n_index

Gets the number of index loaded in the :class:VcfParser` instance.

next()

Read one variant. Raise a StopIteration exception if no data is available.

Returns:The same as an iteration loop (see class description).
num_alt

Number of defined ALT fields.

num_filter

Number of defined FILTER fields.

num_format

Number of defined FORMAT fields.

num_info

Number of defined INFO fields.

num_meta

Number of defined META fields.

num_samples

Number of samples read from header.

read_index(fname)

An index file allows to increase the execution’s speed of the progression of the :class:VcfParser` instance on the variants of a Vcf file. In Fact this binary file contains all start line position of each variant, linked to the chromosome and chromosomal position. That’s allows to move the parser at a specific position according to a chromosome or/and chromosomal position given.

This method allows to find, read and load as an VcfIndex an Index file linked to the current :class:VcfParser` instance. The search of the index file, is done with a default name generated from the name of the Vcf file loaded in the current VcfParser.

Parameters:fname – name of a properly formatted Index file. The Index file must be linked to the vcf data of the :class:VcfParser` instance. In fact in the header on an index file, there are printed the “EOF” index of the vcf file behind it. The “EOF” indexes, of the index file and of the Vcf file passed in the VcfParser, must match. EOF*: End Of File.
read_line(string)

Read one variant from a user-provided single line. The string should contain a single line of VCF-formatted data (no header). All field specifications and sample information should be consistent with the information contained in the header that has been provided at creation-time to this instance (whichever it was read from a file or also provided as a string).

Returns:The same as an iteration loop (see class description).
rewind()

This method allows to move a VcfParser instance to the first variant of the Vcffile loaded.

slider(size, step, miss_pv, fill=False, step_pb=False, start_pv=0, end_pv=None, flat_pv=False, size_pb=None, stop=None)

This method allows to create a sliding window from a current VcfWindow instance. return: an object of the VcfWindow.

Parameters:
  • size – size of the sliding window in sites numbers
  • step – increment of the sliding window in sites numbers or in bases pairs if the argument step_pb is True
  • flat_pv – ignored individual level.
  • start_pv – if a subset of samples must be considered, index of the first sample to consider (by default, all samples are considered).
  • end_pv – if a subset of samples must be considered, index of the last sample to consider (by default, all samples are considered).
  • miss_pv – maximum number of missing alleles. If this proportion is processing is stopped and get_missing() returns max_missing + 1. Only missing data in this data set are considered. The starting point of the sliding window depends of the stream of the Vcf Parser passed in parameter.
  • ploidy – the ploidy must be a strictly positive number.
  • fill – if True, the sliding window will read the first window, else no data will be loaded in the VcfWindow instance before call VcfWindow.next()
  • step_pb – if True, the increment of the sliding window will be in bases pairs. If False the increment will be in sites number.
  • size_pb – sizevcf.make_index() of the sliding window in bases pairs
  • stop – a chromosomial position marcking the end of the forward of the sliding window. If you want the sliding window progresses untile the last variant, stop=None.
Beware:

to use an object of the VcfWindow in a loop statement, maintain

the parameter fill False.

unread()

This methods allows to unread the last variant read by the VcfParser with the method ‘next’

class egglib.io.Variant

Represent a single variant (one line from a VCF-formatted data file). The user cannot create instances of this class himself (instances are generated by VcfParser) and instances are not modifiable in principle (however, some attributes provide mutable objects, as mentioned).

Note

The AA (ancestral allele), AN (allele number), AC (allele count), and AF (allele frequency) INFO fields as well as the GT (deduced genotype) FORMAT are automatically extracted if they are present in the the file and if their definition matches the format specification (meaning that they were not re-defined with different number/type) in the header. If present, they are available through the dedicated attributes AN, AA, AC, AF, GT, GT_ploidy and GT_phased. However, they are still available in the respective info and samples (sub)-dictionaries.

AA

Value of the AA info field (None if missing).

AC

Value of the AC info field, as a tuple (None if missing).

AF

Value of the AF info field, as a tuple (None if missing).

AN

Value of the AN info field (None if missing).

GT

Genotypes from GT fields (only if this format field is available), provided as a tuple of sub-tuples. The number of sub-tuples is equal to the number of samples (num_samples). The number of items within each sub-tuples is equal to the ploidy (GT_ploidy). These items are allele expression (as found in alleles), or None (for missing values). This attribute is None if GT is not available.

GT_phased

Boolean indicating whether the genotype for each sample is phased (None if GT is not available).

GT_ploidy

Ploidy among genotypes (None if GT is not available).

GT_vcf

GT field as written in a vcf file

ID

Tuple containing all IDs (even if just one or none).

alleles

Variant alleles (the first is the reference and is not guaranteed to be present in samples), as a tuple.

alt_type_breakend = 3

Alternate allele symbolizing a breakend (see VCF description for more details).

alt_type_default = 0

Explicit alternate allele (the string represents the nucleotide sequence of the allele).

alt_type_referred = 2

Alternate allele referring to a pre-defined allele (the string provides the ID of the allele).

alternate_types

Alternate alleles types, as a tuple. One value is provided for each alternate allele. The provided values are integers whose values should always be compared to class attributes alt_type_default, alt_type_referred and alt_type_breakend, as in (for the type of the first alternate allele):

type_ = variant.alternate_types[0]
if type_ == variant.alt_type_default:
    allele = variant.allele(0)rewind
chromosome

Chromosome name (None if missing).

failed_tests

Named of filters at which this variant failed, as a tuple (None if no filters applied).

format_fields

Available FORMAT fields ID’s available for each sample, as a frozenset (empty if no sample data is available).

info

Dictionary of INFO fields for this variant. Keys are ID of INFO fields available for this variant, and values are always a tuple of items. For flag INFO types, the value is always an empty tuple.

Note

This dict is mutable, which enables the user to modify the data contained in the instance. Note that this will modify the data contained in this Variant instance, although not in the related VcfParser instance.

num_alleles

Number of alleles (including the reference in all cases).

num_alternate

Number of alternate. Equal to num_alleles minus 1.

num_samples

Number of samples (equivalent to len(Variant.samples)).

position

Position (as an index; first value is 0) (None if missing).

quality

Variant quality (None if missing).

samples

List of information available for each sample (empty list if no samples are defined). The list contains one dict for each sample: keys of these dictionary are FORMAT fields ID (the keys are always the same as the content of format_fields), and their values are tuples in all cases.

Note

This list and the dict instances it contains are all mutable, which enables the user to modify the data contained in the instance. Note that this will modify the data contained in this Variant instance, although not in the related VcfParser instance.

class egglib.io._vcf.VcfWindow

This class allows to create a sliding Window on a VcfParser instance. But this class cannot be called directly. So to create a sliding window you must call the method VcfParser.slider() of the VcfParser.

VcfWindow instances are iterable by the special _iter__() (with support ‘for’ statement). The iteration is done on the sites of the current sliding window. Each itteration return a stats._site.Site instance. VcfWindow instances are slices objects (usable by the next command: self[key]). The method __getitem__() allows the access by index. This method gets the site at given index. Negatives values are not allowed, else returns a ‘ValueError’.

chromosome

get the chromosome in the current VcfWindow instance

configuration()

This method allows to print all variables used to configure the sliding window

configure(size, step, miss_pv, fill=False, step_pb=False, start_pv=0, end_pv=None, flat_pv=False, size_pb=None, stop=None)

Configure a sliding window

Returns:

an object of the VcfWindow.

Parameters:
  • size – size of the sliding window in sites numbers
  • step – increment of the sliding window in sites numbers or in bases pairs if the argument step_pb is True
  • flat_pv – ignored individual level.
  • start_pv – if a subset of samples must be considered, index of the first sample to consider (by default, all samples are considered).
  • end_pv – if a subset of samples must be considered, index of the last sample to consider (by default, all samples are considered).
  • miss_pv – maximum number of missing alleles. If this proportion is processing is stopped and get_missing() returns max_missing + 1. Only missing data in this data set are considered. The starting point of the sliding window depends of the stream of the Vcf Parser passed in parameter.
  • ploidy – the ploidy must be a strictly positive number.
  • fill – if True, the sliding window will read the first window, else no data will be loaded in the VcfWindow instance before call VcfWindow.next()
  • step_pb – if True, the increment of the sliding window will be in bases pairs. If False the increment will be in sites number.
  • size_pb – sizevcf.make_index() of the sliding window in bases pairs
  • stop – a chromosomial position marcking the end of the forward of the sliding window. If you want the sliding window progresses untile the last variant, stop=None.
end_position

end position of the current VcfWindow instance.

good

Checks if VcfWindow instance can continue to grow on the VcfParser instance

next()

Allows to move forward the sliding window, from the initial position of the vcfparser to a chromosomial position passed as parameter ‘stop’ in the configuration method, until the last variant of the VcfParser instance or of the chromosome read by the sliding window. This progression depends of the size window and the increment. To start the sliding window at a specific position, use the method VcfParser.goto() method of the VcfParser before call the the method VcfParser.slider() of the VcfParser

num_sites

get the number of sites in the current VcfWindow instance

size()

Get the size of the current slidding window in pairs bases. This size is calculated as the difference between the chromosomal position of the first and the last variant of the VcfWindow instance.

start_position

start position of the current VcfWindow instance.

Legacy parsers

egglib.io.from_clustal(string)

Imports a clustal-formatted alignment. The input format is the one generated and used by CLUSTALW (see http://web.mit.edu/meme_v4.9.0/doc/clustalw-format.html).

Parameters:string – input clustal-formatted sequence alignment.
Returns:A new Align instance.

Changed in version 3.0.0: Renamed (previous name was aln2fas()). Input argument is a string rather than a file.

egglib.io.from_staden(string, delete_consensus=True)

Import the output file of the GAP4 program of the Staden package.

The input file should have been generated from a contig alignment by the GAP4 contig editor, using the command “dump contig to file”. The sequence named CONSENSUS, if present, is automatically removed unless the option delete_consensus is False.

Staden’s default convention is followed:

  • - codes for an unknown base and is replaced by N.
  • * codes for an alignment gap and is replaced by -.
  • . represents the same sequence than the consensus at that position.
  • White space represents missing data and is replaced by ?.

New in version 2.0.1: Add argument delete_consensus.

Changed in version 2.1.0: Read from string or fname.

Changed in version 3.0.0: Renamed from_staden(). Only string input is supported now.

egglib.io.from_genalys(string)

Converts Genalys-formatted sequence alignment files to fasta. This function imports files generated through the option Save SNPs of Genalys 2.8.

Parameters:string – input data as a Genalys-formatted string.
Returns:An Align instance.

Changed in version 3.0.0: Renamed from_genalys(). Only string input is supported now.

egglib.io.get_fgenesh(string, locus='locus')

Imports fgenesh output.

Parameters:fname – a string containing fgenesh ouput.
Parma locus:locus name.
Returns:A list of gene and CDS features represented by dictionaries. Note that 5’ partial features might not be in the appropriate frame and that it can be necessary to add a codon_start qualifier.

Changed in version 3.0.0: Input as string. Added locus argument.

class egglib.io.GenBank(fname=None, string=None)

This class represents a GenBank-formatted DNA sequence record.

Parameters:
  • fname – input file name.
  • string – GenBank-formatted string.

Only one of the two arguments fname and string can be non-None. If both are None, the constructor generates an empty instance with sequence of length 0. If fname is non-None, a GenBank record is read from the file with this name. If string is non-None, a GenBank record is read directly from this string. The following variables are read from the parsed input if present: accession, definition, title, version, GI, keywords, source, references (which is a list), locus and others. Their default value is None except for references and others for which default is an empty list. source is a (description, species, taxonomy) tuple. Each of references is a (header, raw reference) tuple and each of others is a (key, raw) tuple.

In addition to methods documented below, the following operations are supported for gb if it is a GenBank instance:

Expression Action
len(gb) Length of the sequence attached to this record
str(gb) GenBank representation of the record
for feat in gb Iterate over GenBankFeature instances of this record
add_feature(feature)

Add a feature to the instance. The argument feature must be a well-formed GenBankFeature instance.

extract(from_pos, to_pos)

Return a new GenBank instance representing a subset of the current instance, from position from_pos to to_pos. All (and only) features that are completely included in the specified range are exported.

number_of_features()

Give the number of features contained in the instance.

rc()

Reverse-complement the instance (in place). All features positions and the sequence will be reverted and applied to the complementary strand. The features will be sorted in increasing start position (after reverting). This method should be applied only on genuine nucleotide sequences.

sequence

Sequence string (can be modified). Note that changing the record’s string might obsolete the features (meaning that the setting an invalid sequence might cause the features to point to incorrect or out-of-bounds regions of the sequence).

write(fname)

Create a file named fname and write the formatted record in.

write_stream(stream)

Writes the content of the instance as a Genbank-formatted string within the passed file (or file-compatible) stream.

class egglib.io.GenBankFeature(parent)

Instances of this class represent features associated to a GenBank instance. They should not be instantiated or used separatedly of a GenBank instance. The constructor creates an empty instance (although a GenBank instance must be passed as parent) and either set() or parse() must be used subsequently.

Parameters:parent – a GenBank instance to which the feature should be attached.

In addition to methods documented below, the following operations are supported for feat if it is a GenBankFeature instance:

Expression Action
str(feat) GenBank representation of the feature
add_qualifier(key, value)

Add a qualifier to the instance’s qualifiers.

copy(genbank)

Return a copy of the current instance, connected to the GenBank instance genbank.

get_sequence()

Return the string corresponding to this feature. If the positions pass beyond the end of the parent’s sequence, a RuntimeError (and not an IndexError) is raised.

get_start()

First position of the first (or unique) segment, in such a way that start() is always smaller than stop().

get_stop()

Last position of the last (or unique) segment, in such a way that start() is always smaller than stop().

get_type()

Return the type string of the instance.

parse(string)

Update feature information from information read in a GenBank-formatted string.

qualifiers()

Return a dictionary with all qualifier values. This method cannot be used to change data within the instance. Note that changes of the returned dictionary don’t affect data contained in the instance.

Changed in version 2.1.0: Meaning changed.

rc(length=None)

Reverse-complement the feature: apply it to the complement strand and reverse positions counting from the end. The length argument specifies the length of the complete sequence and is usually not required.

shift(offset)

Shift all positions according to the (positive of negative) argument.

update(feat_type, location, **qualifiers)

Update feature information.

Parameters:
  • feat_type – a string identifying the feature type (such as "gene", "CDS", "misc_feature", etc.). All strings are acceppted.
  • location – a GenBankFeatureLocation instance giving the feature’s location.
  • qualifiers – other qualifiers must be passed as keyword arguments. It is not allowed to use "type" as a qualifier keyword.
class egglib.io.GenBankFeatureLocation(string=None)

Hold the location of a GenBank feature. Supports various forms of location as defined in the GenBank format specification. The constructor contains a parser working from a GenBank-formatted string. By default, features are on the forward strand and segmented features are ranges (not orders).

In addition to methods documented below, the following operations are supported for loc if it is a GenBankFeatureLocation instance:

Expression Action
len(loc) Number of segments
loc[index] Return the (fist, last) tuple for the corresponding segment
for (first, last) in loc Iterator over segments
str(params) Generate a GenBank representation

GenBankFeatureLocation supports iteration and allows to iterate over (first,last) segments regardless of their types (for a single-base segment a position position, the tuple (position,position) is returned; similar 2-item tuples are returned for other types of segment as well).

add_base_choice(first, last, left_partial=False, right_partial=False)

Add a segment corresponding to a single base chosen within a base range. If no segments were previously enter, set the unique segment location. first and last must be integers. The feature will be set between first and last positions, including both limits. If the feature is intended to be placed on the complement strand between positions, say, 1127 and 1482, one must use add_base_choice(1127,1482) in combination with set_complement(). All entered positions must be larger than any positions entered previously and last must be strictly larger than first. left_partial and/or right_partial must be set to True if, respectively, the real start of the segment lies 5’ of first and/or the real end of the segment lies beyond last (relatively to the forward strand and consistently with the numbering system).

add_base_range(first, last, left_partial=False, right_partial=False)

Add a base range the feature. If no segments were previously enter, set the unique segment location. first and last must be integers. The feature will be set between first and last positions, including both limits. If the feature is intended to be placed on the complement strand between positions, say, 1127 and 1482, one must use add_base_range(1127,1482) in combination with set_complement(). All entered positions must be larger than any positions entered previously and last must be larger than first (but can be equal). left_partial and/or right_partial must be set to True if, respectively, the real start of the segment lies 5’ of first and/or the real end of the segment lies beyond last (relatively to the forward strand and consistently with the numbering system).

add_between_base(position)

Add a segment lying between two consecutive bases. If no segments were entered previously, set the unique segment location. position must be an integer. The feature will be set between position and position + 1. If the feature is intended to be placed on the complement strand between positions, say, 1127 and 1128, one must use add_between_base(1127) in combination with set_complement(). All entered positions must be larger than any positions entered previously.

add_single_base(position)

Add a single-base segment to the feature. If no segments were entered previously, set the unique segment location. position must be an integer. All entered positions must be larger than any positions entered previously.

as_order()

Define the feature as an order instead of a range.

as_range()

Define the features as a range, with is the default.

copy()

Return a deep copy of the current instance.

is_complement()

True if the feature is on the complement strand.

is_range()

True if the feature is a range (the default), False if it is an order.

rc(length)

Reverse the feature positions: positions are modified to be counted from the end. The length of the complete sequence must be passed.

set_complement()

Place the feature on the complement strand.

set_forward()

Place the feature on the forward (not complement) strand, which is the default.

shift(offset)

Shift all positions according to the (positive of negative) argument.