EggLib

Table Of Contents

Previous topic

Import/export utilities

Next topic

Coalescence simulator

This Page

Diversity statistics

Site

class egglib.stats.Site

Store allelic and genotypic data at a single site. Allelic values are represented by integers. In case of complex, such as structural, variation, one can use the index of variants as allelic values, and some objects can lack allelic value information (a number of alleles of 0 will be reported).

Loading data is incremental: processed new data will be added to previously loaded data (requiring that ploidy matches). To avoid this, use reset().

alleles()

Generate the list of alleles

as_list(flat=False, skip_outgroup=False)

Generate one or two lists containing data from the instance.

Parameters:
  • flat – flatten return lists and skip the individual level (by default, return lists are a list of tuples which represent individuals).
  • skip_outgroup – return only data for the ingroup (by default, return two lists for respectively ingroup and outgroup data).
Returns:

lists of allelic index (or None for missing data). the return value is either a single or two lists (based on the value of skip_outgroup), and the list or lists contain tuples representing individuals unless flat is True.

ingroup(indiv[, chrom])

Get a given ingroup genotype or allele, as allele indexes.

Parameters:
  • indiv – individual index.
  • chrom – chromosome index.

If chrom is omitted, return the genotype as a tuple. Otherwise, return the specific allele. Setting chrom to None is the same as omitting it. chrom can be safely omitted for haploid data.

no

Total number of samples (alleles) in the outgroup.

ns

Total number of samples (alleles) in the ingroup.

num_alleles

Number of alleles.

num_ingroup

Total number of individuals in the ingroup.

num_missing

Number of missing data (expressed in number of samples).

num_missing_ingroup

Number of missing data (expressed in number of samples) in the ingroup only.

num_missing_outgroup

Number of missing data (expressed in number of samples) in the outgroup only.

num_outgroup

Total number of individuals in the outroup.

outgroup(indiv, chrom=None)

outroup(indiv[, chrom])

Get a given outgroup genotype or allele, as allele indexes.

Parameters:
  • indiv – individual index.
  • chrom – chromosome index.

If chrom is omitted, return the genotype as a tuple. Otherwise, return the specific allele. Setting chrom to None is the same as omitting it. chrom can be safely omitted for haploid data.

ploidy

Current value of the ploidy (to change it, one needs to reset the instance).

process_align(align, index, filtr, struct=None, reset=True)

Import data from the provided Align to the data currently held by the instance. Arguments are identical to the function site_from_align(), expect reset.

Parameters:reset – if True, reset the instance as if newly created. If False, append the data to current data, if any.

If reset is False and this instance currently holds data, the ploidy defined by the struct argument is required to match the current value. If struct is None, the implied ploidy is 1 and is still required to match.

process_list(ingroup, outgroup, flat=False, reset=True)

Import data from the provided lists to the data currently held by the instance. Arguments are identical to the function site_from_list(), expect reset.

Parameters:reset – if True, reset the instance as if newly created. If False, append the data to current data, if any.

If reset is False and this instance currently holds data, the ploidy defined by the input data is required to match the current value. If flat is True, the implied ploidy is 1 and is still required to match.

process_vcf(vcf, start=0, stop=None, flat=False, reset=True)

Import data from the provided VCF parser to the data currently held by the instance. Arguments are identical to the function site_from_vcf(), expect reset.

Parameters:reset – if True, reset the instance as if newly created. If False, append the data to current data, if any.

If reset is False and this instance currently holds data, the ploidy defined by the input data is required to match the current value. If flat is True, the implied ploidy is 1 and is still required to match.

Warning

VCF genotypes are exported as allele indexes (0 for the reference allele). This function treats them as allele values, meaning that they might be shifted (0 is the first allele found, which is not necessarily the reference allele). Be aware of this fact when appending VCF data to data from other sources (including other VCF files) in the same site data, or when processing individual alleles.

reset()

Clear all data from the instance.

egglib.stats.site_from_align(align, index, filtr, struct=None)

Import allelic and genotypic data from a position of the provided Align instance. The struct argument allows to process only a subset of the samples, and also controls the genotypic struct and the ploidy. This means that the same Structure must be used again to further process the resulting Site instance.

Parameters:
  • align – a Align instance.
  • index – the index of a valid (not out of range) position of the alignment.
  • filtr – a Filter instance providing the list of alleles to be considered as valid or missing data.
  • struct – a Structure instance that will be used to group samples in individuals. Note that if the Structure describes only a subset of the samples of the alignment, only those samples will be included in the resulting Site instance. By default, all sampled are processed in haploid individuals.
Returns:

A new Site instance. The numbers of ingroup and outgroup items of this instance are defined by the Structure instance passed as struct and can be smaller than the number of samples of the original alignment.

egglib.stats.site_from_list(ingroup, outgroup, flat=False)

Import allelic and genotpyic data from provided lists. Input data are equivalent to the return value of Site.as_list(). ingroup and outgroup provide data for the ingroup and outgroup respectively. Either can be replaced by None which is equivalent to an empty list. They are supposed to be lists of tuples if flat is False, and lists of integers otherwise, but lists and tuples can be replaced by other sequence types.

Parameters:
  • ingroup – if flat is False (default): a list of tuples (one tuple per individual) which contain a constant number of allelic values. All allelic values are taken as valid alleles except None which represent missing data. Alleles can be any integers. The number of items of all individuals of both ingroup and outgroup is required to be constant. If flat is True: a lists of integers (None for missing data) giving allelic values of all samples.
  • outgroup – same format as ingroup.
  • flat – determines if genotypic (default) or allelic data are provided.
egglib.stats.site_from_vcf(vcf, start=0, stop=None, flat=False)

Import allelic and genotypic data from a VCF parser. The VCF parser must have processed a variant and the variant is required to have genotypic data available as the GT format field. An exception is raised otherwise.

Warning

VCF genotypes are exported as allele indexes (0 for the reference allele). This function treats them as allele values, meaning that they might be shifted (0 is the first allele found, which is not necessarily the reference allele). Be aware of this fact when appending VCF data to data from other sources (including other VCF files) in the same site data, or when processing individual alleles.

Parameters:
  • vcf – a VcfParser instance containing data. There must at least one sample and the GT format field must be available. It is not required to extract variant data manually.
  • start – index of the first sample to process. Index is required to be within available bounds (it must be at least 0 and smaller than the number of samples in the VCF data). Note that in a VCF file a sample corresponds to a genotype.
  • stop – sample index at which processing must be stopped (this sample is not processed). Index is required to be within available bounds (if must be at least equal to start and not larger than the number of samples in the VCF data). Note that in a VCF file, a sample corresponds to a genotype.
  • flat – ignore individuals and load data as haploid genotypes. By default, genotypes are imported based on the ploidy defined in the data.

Freq

class egglib.stats.Freq

Hold allelic and genotypic frequencies for a single site. .Freq instances can be created using the three functions freq_from_site(), freq_from_list(), and freq_from_vcf(), or using the default constructor. After it is created by any way, instances can be re-used (which is faster), using their methods process_site(), process_list(), and process_vcf().

freq_allele(allele, cpt=0, idx=None)

Get the frequency of an allele.

Parameters:
  • allele – allele index.
  • cpt – compartment identifier.
  • idx – compartment index (required for clusters and populations, ignored otherwise).
freq_genotype(genotype, cpt=0, idx=None)

Get the frequency of an genotype.

Parameters:
  • genotype – genotype index.
  • cpt – compartment identifier.
  • idx – compartment index (required for clusters and populations, ignored otherwise).
genotype(idx)

Get a genotype, as a tuple of allele indexes.

nieff(cpt=0, idx=None)

Get the number of individuals within a given compartment. In the haploid case, this method is identical to nseff().

Parameters:
  • cpt – compartment identifier.
  • idx – compartment index (required for clusters and populations, ignored otherwise).
nseff(cpt=0, idx=None)

Get the number of samples within a given compartment.

Parameters:
  • cpt – compartment identifier.
  • idx – compartment index (required for clusters and populations, ignored otherwise).
num_alleles

Number of alleles in the whole site.

num_clusters

Number of clusters.

num_genotypes

Number of genotypes in the whole site.

num_populations

Number of populations.

ploidy

Ploidy

process_list(ingroup, outgroup, geno_list=None)

Reset the instance as if it had been created using freq_from_list(). Arguments are identical to this function.

process_site(site, struct=None, consider_genotype_ordering=True)

Reset the instance as if it had been created using freq_from_site(). Arguments are identical to this function.

process_vcf(vcf)

Reset the instance as if it had been created using freq_from_vcf(). Argument is identical to this function.

egglib.stats.freq_from_site(site, struct=None, consider_genotype_ordering=False)

Create a new Freq instance based on data of the provided site.

Parameters:
  • site – a Site instance.
  • struct

    this argument can be:

    • A Structure instance with ploidy equal to 1 (since the individual level must already be implemented in the provided Site), allowing to select a subset of samples and/or define the structure. This is the only way to specify a hierachical (with clusters) structure.
    • A list (or compatible) of at least one integer providing the sample size (as numbers of individuals) of all populations, assuming the individuals are organized in the corresponding order (all individuals of a given population grouped together).
    • None (no structure, all individuals placed in a single population).
  • consider_genotype_ordering – consider the order of alleles in a genotype as significant.
Returns:

A new Freq instance.

egglib.stats.freq_from_list(ingroup, outgroup, geno_list=None)

Create a new Freq instance based on already computed frequency data.

Parameters:
  • ingroup – a nested list of genotype or allele frequencies (based on the value of geno_list). The list must have three levels: (i) clusters, (ii) populations, and (iii) alleles/genotypes. If clusters/populations are not defined, a single-item list should be provided for this level. Empty lists are also allowed. The frequencies must be null or positive integers. The number of frequencies per population is required to be constant for all populations (corresponding to the number of alleles or genotypes). If geno_list is defined, data must be the frequencies of the provided genotypes, in the same order. Otherwise, data must be allelic frequencies, in the order of increasing allele index (in the latter case, data will be loaded as haploid).
  • outgroup – a list of allele/genotype frequencies for the outgroup. The number of alleles or genotypes is also required to match. If None, no outgroup samples (equivalent to a list of zeros).
  • geno_list – list of genotypes. Genotypes must be provided as tuples or lists. Their length is equal to the ploidy and is required to be at least two and constant for all genotypes. Order of alleles within genotypes is significant. If None, data are loaded as haploid alleles.
Returns:

A new Freq instance.

Note that it is required that there is at least one cluster and one population.

egglib.stats.freq_from_vcf(vcf)

Import allelic frequencies from a VCF parser. The VCF parser must have processed a variant and the variant is required to have frequency data available as the AC format field along with the AN field. An exception is raised otherwise.

This function only imports haploid allele frequencies in the ingroup (without structure). The first allele is the reference, by construction, then all the alternate alleles in the order in which they are provided in the VCF file.

Parameters:vcf – a VcfParser instance containing data. There must at least one sample and the AN/AC format fields must be available. It is not required to extract variant data manually.

Diversity for gene families

egglib.stats.paralog_pi(align, struct_p, struct_i, max_missing=0, filtr=None)

Compute diversity statistics for a gene family (Innan 2003). An estimate of genetic diversity is provided for every paralog and for every pair of paralogs, provided that enough non-missing data is available (at least 2 samples are required).

Parameters:
  • align – an Align containing the sequence of all available paralog for all samples. The outgroup is ignored.
  • struct_p – a Structure providing the organisation of sequences in paralogs. This structure must have a ploidy of 1 (no individual structure). Clusters, if defined, are ignored. The sequences of all individuals for a given paralog should be grouped together in that structure. There might be a different number of sequences per paralog due to missing data.
  • struct_i – a Structure providing the organisation of sequences in individuals. This structure must have a ploidy of 1 (no individual structure). Clusters, if defined, are ignored. The sequences of all paralogs for a given individual should be grouped together in that structure. There might be a different number of sequences per individual due to missing data.
  • max_missing – maximum proportion of missing data (if there are more missing data at a site, the site is ignored).
  • filtr – a Filter instance providing the list of exploitable allelic values and those that should be considered as missing. By default, use filter_dna.
Returns:

A new ParalogPi instance which provides methods to access the number of used sites and the diversity for each paralog/paralog pair.

class egglib.stats.ParalogPi

Class computing Innan’s within- and between-paralog diversity statistics. See paralog_pi() for more details. This class can be used directly (1) to analyse data with more efficiency (by reusing the same instance) or (2) to combine data from different alignments, or (3) for pass individual sites. Do first call setup().

Pib(i, j)

Get between-paralog diversity for paralogs i and j.

Piw(i)

Get within-paralog diversity for paralog i.

num_sites([i[, j]])

Number of sites with any data (without arguments), with data for paralog i (if only i specified), or with data for the pair of paralogs i and j (if both specified).

process_align(aln, max_missing=0, filtr=None)

Process an alignment matching the structure passed to setup(). Diversity estimates are incremented (no reset).

Parameters:
  • aln – an Align instance.
  • max_missing – maximum proportion of missing data.
  • filtr – a class:.Filter instance.
process_site(site)

Process a site matching the structure passed to setup(). Diversity estimates are incremented (no reset).

Parameters:site – a Site instance.
setup(struct_p, struct_i)

Specify the structure in paralog and individuals. The two arguments are Structure instances as described for paralog_pi(). Only this method resets the instance.

haplotypes()

egglib.stats.haplotypes(sites, impute_threshold=0, struct=None, filtr=None, max_missing=0.0, consider_outgroup_missing=False, dest=None, multiple=False)

Identify haplotypes from sites provided as either an Align instance or a list of Site instances, and return data as a single Site instance containing one sample for each sample of the original data. Alleles in the returned site are representing all identified haplotypes (or missing data when the haplotypes could not be derived.

Note

There must be at least one site with at least two alleles (overall, including the outgroup), otherwise the produced site only contains missing data.

Parameters:
  • sites – an Align instance, or a list of Site instances.
  • impute_threshold – by default, all samples with a least one occurrence of missing data will be treated as missing data. If this argument is more than 0, the provided value will be used as maximum number of missing data. All samples with as many or less missing data will be processed to determine which extant of haplotype they might belong (to which they are identical save for missing data). If there is only one such haplotype, the corresponding samples will be treated as a repetition of this haplotype. This option will never allow detecting new haplotypes. Only small values of this option make sense.
  • struct – a Structure instance defining the samples to process. Only valid if sites is a Align. The population and cluster structures are not used. If the ploidy is larger than 1, the individuals are used, and sites are assumed to be phased.
  • filtr – a Filter instance controlling what allelic values are supported. By default, assume DNA sequences. Only allowed if an Align is passed.
  • max_missing – maximum proportion of missing data to process a site. Only considered if an Align is passed.
  • consider_outgroup_missing – if True, outgroup samples are included in the count for missing data (by default, outgroup samples are not considered). Only considered if an Align is passed.
  • dest – a Site instance that will be reset and in which data will be placed. If specified, this function returns nothing. By default, the function returns a new Site instance.
  • multiple – allow sites with more than two alleles in the ingroup.
Returns:

A Site instance, (if dest is None) or None (otherwise).

CodingDiversity

class egglib.stats.CodingDiversity(*args, **kwargs)

This class processes alignments with a reading frame specification in order to detect synonymous and non-synonymous variable positions. It provides basic statistics, but it can also filter data to let the user compute all other statistics on synonymous-only, or non-synonymous-only variation (e.g. \pi or D).

The constructor takes optional arguments. By default, build an empty instance. If arguments are passed, they must match the signature of process() that will be called.

The method process() does all the work. Once it is called, data are available through the different instance attributes, and it is possible to generate alignments containing only codon sites with either one synonymous or one non-synonymous mutation. It is also possible to iterate over sites of both kind. In both cases, the generated data contains only codons, where each codon is represented by a single integer (see the methods tools.int2codon()). These data can be analysed in the module stats using the pre-defined Filter instance stats.filter_codon.

Note that, currently, the outgroup is ignored.

iter_NS()

Iterate over non-synonymous sites. Mostly similar to the method iter_S() (see important warning about the fact that returned values are actually always the same SiteFrequency instance that is updated at each iteration round).

iter_S()

Iterate over synonymous sites. Proposed as a more performant alternative to mk_align_S(). This method returns a generator that can be used in expressions such as:

cs = egglib.stats.ComputeStats()
cs.add_stat('thetaW')
thetaW = 0.0
cdiv = egglib.stats.CodingDiversity(align, frame)
for site in cdiv.iter_S():
    stats = cs.process_site(site)
    thetaW += stats['thetaW']

Warning

This method returns repetively the same SiteFrequency with information updated at each round. The reason for this is performance. Never keep a reference to the iterator variable (that’s site in the example above) outside the loop body.

mk_align_NS()

Create an Align instance with only non-synonymous codon sites. Mostly similar to the method mk_align_S().

mk_align_S()

Create an Align instance with only synonymous codon sites. The alignment contains the same number of ingroup and outgroup samples as the original alignment, and a number of sites equal to num_pol_S. Note that the returned alignment does not have group labels. These data can be analysed in the module stats using the pre-defined Filter instance stats.filter_codon.

num_codons_eff

Number of codon sites that have been analysed (like num_codons_tot but excluding sites rejected because of missing data).

num_codons_stop

Number of codon sites (among those that have been analysed) with at least one codon stop in them.

num_codons_tot

Total number of considered codon sites that have been co. Only complete codons have been considered, but this value includes codons that have been rejected because of missing data.

num_pol_NS

Number of polymorphic codon sites with only one non-synonymous mutation.

num_pol_S

Number of polymorphic codon sites with only one synonymous mutation.

num_pol_multi

Number of polymorphic codons with only more than one mutation.

num_pol_single

Number of polymorphic codons with only one mutation. If the option allow_multiple of process() was set to True, codon sites with two codon alleles that differ at more than one of the three codon positions are considered as if only one mutation occurred.

num_sites_NS

Estimated number of non-synonymous sites. Note that the total number of sites per codon is always 3.

num_sites_S

Estimated number of synonymous sites. Note that the total number of sites per codon is always 3.

process(align, frame=None, struct=None, code=1, max_missing=0.0, consider_outgroup_missing=False, skipstop=True, allow_multiple=False, raise_stop=False, allow_alt=False)

Process an alignment. It this instance already had data in memory, they will all be erased.

Parameters:
  • align – a Align instance containing the coding sequence to process. All sequences must be proper nucleotide sequences (only upper case characters). Standard ambiguity characters and alignment gaps are supported (but are treated as missing data) and unrecognized characters cause an error. DNA sequences (A, C, G and T as exploitable characters) are expected, as encoded using the ASCII table (which is the default for data imported from fasta files).
  • frame – a tools.ReadingFrame instance specifying the position of coding sequence fragments (that is, exons without UTR) and the reading frame, or None. It is legal to pass a reading frame specification with either terminal or internal gaps (only full codons will be considered). If None, assume that coding sequences have been provided (the whole sequence can be translated in the first reading frame without interruption).
  • struct – a Structure instance defining the structure to analyze. If None, no structure is used (all samples are placed in a single population). If not None, the passed Structure instance must contain a population level or an individual level, or both. If both, they are required to be nested (individuals are mapped to populations). If individuals are specified, the outgroup samples must be mapped also, otherwise the ougroup will be ignored.
  • code – genetic code identifier (see Available genetic codes). Required to be an integer among the valid values. The default value is the standard genetic code.
  • max_missing – maximum proportion of missing data to allow (including stop codons if skipstop if true). By default, all codons with any missing data are excluded.
  • consider_outgroup_missing – if True, outgroup samples are included in the count for missing data (by default, outgroup samples are not considered). Only considered if an Align is passed.
  • skipstop – if True, stop codons are treated as missing data and skipped. If so, potential mutations to stop codons are not taken into account when estimating the number of non-synonymous sites. Warning (this may be counter-intuitive): it actually assumes that stop codons are not biologically plausible and considers them as missing data. On the other hand, if skipstop is False, it considers stop codons as if they were valid amino acids. This option has no effect if raise_stop is True.
  • allow_multiple – by default, if there are only two variant codons at a site, but these codons differ at more than one of the three codon positions, then this site is rejected because of multiple hits at the same. If this option is set to True, consider that two different codon alleles are always caused by a single mutation, even if they differ at more than one of the three positions.
  • raise_stop – raise a ValueError if a stop codon is met. If True, skipstop has no effect. This method does allow to ensure that stop codons are not present in an alignment, but only that stop codons are not present in any of the considered sites.
  • allow_alt – a boolean telling whether alternative start (initiation) codons should be considered. If False, codons are translated as a methionine (M) if, and only if, there are among the alternative start codons for the considered genetic code and they appear at the first position for the considered sequence (excluding all triplets of gap symbols appearing at the 5’ end of the sequence). With this option, it is required that all sequences start by a valid initiation codon unless the first codon is partial or contains invalid data (in such cases, it is ignored).

Linkage disequilibium

egglib.stats.pairwise_LD(locus1, locus2, multiple_policy='main', min_freq=0)

This function computes linkage disequilibium between a pair of loci.

Site instances must be used to describe loci. Only the order of samples is considered and both loci must have the same number of samples.

Parameters:
  • locus1 – A Site instance.
  • locus2 – A Site instance.
  • multiple_policy – Determine what is done if either input locus has more than two alleles. Possible values are, "forbid" (raise an exception if thise occurs), "main" (take the most frequent allele of each locus) and "average" ( compute the unweighted average over all possible pair of alleles). More options might be added in future versions. This option is ignored if both loci have less than three alleles. If "main" and there are several equally most frequent alleles, the first-occurring one is used (arbitrarily).
  • min_freq – Only used if at least one site has more than two alleles and multiple_policy is set to average. Set the minimum absolute frequency to consider an allele.
Returns:

A dictionary of linkage disequilibrium statistics. In case statistics cannot be computed (either site fixed, or less than two samples with non-missing data at both sites), computed values are replaced by None. n gives the number of pairs of alleles considered.

egglib.stats.matrix_LD(align, stats, multiple_policy='main', min_freq=0, min_n=2, max_maj=1.0, positions=None, filtr=None)

Compute the matrix of linkage disequilibrium statistics between all pairs of sites of the provided alignment. The computed statistics are selected by an argument of this function. Return a matrix (as a nested list) of the requested statistics. In all cases, all pairs of sites are present in the returned matrices. If statistics cannot be computed, they are replaced by None.

The available statistics are:

  • d – Distance between sites of the pairs.
  • D – Standard linkage disequilibrium.
  • Dp – Lewontin’s D’.
  • r – Correlation coefficient.
  • rsq – Equivalent to r2.
Parameters:
  • align – A Align instance.
  • stats – Requested statistic or statistics (see list of available statistics above, as a single string or as a list of one or more of these statistics (in any order).
  • multiple_policy – Specify what is done for pairs of sites for which at least one locus has only one allele. See pairwise_LD() for further description.
  • min_freq – Only used if at least one site has more than two alleles and depending on the value of multiple_policy. See pairwise_LD() for further description.
  • min_n – Minimum number of samples used (this value must always be larger than 1). Sites not fulfilling this criterion will be dropped.
  • max_maj – Maximum relative frequency of the majority allele. Sites not fulfilling this criterion will be dropped.
  • positions – A sequence of positions, whose length must match the number of sites of the provided alignment. Used in the return value to describe the used sites, and, if requested, to compute the distance between sites. By default, the position of sites in the original alignment is used.
  • filtr – A Filter instance providing the list of valid (including missing) allelic values.
Returns:

Returns a tuple with two items: first is the list of positions of sites used in the matrix (a subset of the sites of the provided alignment), with positions provided by the corresponding argument (by default, the index of sites); second is the matrix, as the nested lower half matrix. The matrix contains items for all i and j indexes with 0 <= j <= i < n where n is the number of retained sites. The content of the matrix is represented by a single value (if a single statistic has been requested) or as a list of 1 or more values (if a list of 1 or more, accordingly, statistics have been requested), or None for the diagonal or if the pairwise comparison was dropped for any reason.

EHH

class egglib.stats.EHH

This class computes Extended Haplotype Homozygosity statistics and derivatives. Some statistics are available for unphased genotypic data.

The usage of this class is to: first, set the core haplotypes by passing a Site to set_core(), and then load repetitively distant sites (always with increasing distance from the core), through load_distant(), until the list of sites to process is exhausted or one of the thresholds has been reached.

In order to process distant sites using the same core region to the opposite direction, or to use a different core region, it is always required to call set_core() again (this is the only way to reset the instance).

After at least one distant site site is loaded, EHH statistics can be accessed using there accessors. EHH statistics fell into four categories:

  1. Raw EHH statistics, provided for each loaded distant sites and separately for each core haplotype. See the methods: get_EHH(), get_EHHc(), and get_rEHH().

    Reference: Sabeti P.C., D.E. Reich, J.M. Higgins, H.Z.P. Levine, D.J. Richter, S.F. Schaffner, S.B. Gabriel, J.V. Platko, N.J. Patterson, G.J. McDonald, H.C. Ackerman, S.J. Campbell, D. Altshuler, R. Cooper, D. Kwiatkowski, R. Ward & E.S. Lander. 2002. Detecting recent positive selection in the human genome from haplotype structure. Nature 419: 832-837.

  2. Integrated EHH statistics, computed separately for each core haplotype and incremented at each provided sites until the EHH value reaches the thresholds provided as the EHH_thr and EHHc_thr arguments to set_core(). See the methods get_iHH(), get_iHHc(), and get_iHS(). The methods done_EHH() and done_EHHc() allow to check whether the threshold has been reached for all genotypes.

    Reference: Voight B.F., S. Kudaravalli, X. Wen & J.K. Pritchard. 2006. A map of recent positive selection in the human genome. PLoS Biol 4: e772.

  3. Whole-site EHHS statistic and its integrated statistic iES, which is incremented while the EHHS value is larger than or equal to the threshold provided as the EHHS_thr argument to set_core(). See the methods get_EHHS() and get_iES(). These are the only statistics, with the EHHS decay mentioned below, that can computed with unphased data.

    Reference: Tang K., K.R. Thornton & M. Stoneking. 2007. A new approach for using genome scans to detect recent positive selection in the human genome. PLoS Biol. 5: e171.

  4. EHH and EHHS decay statistics, that give the minimal distance at which, respectively, EHH starts to be smaller than the threshold provided as the EHH_thr argument to set_core() and EHHS starts to be smaller than the threshold provided as the EHHS_thr argument to set_core(). EHH decay is computed separately for each core haplotype. See the methods get_dEHH() and get_dEHHS(). The values are not available until the respective threshold is reached (None is returned). The method done_dEHH() allows to check whether the threshold for the EHH decay has been reached for all core haplotypes. The maximum and average value of the EHH decay across all core haplotypes can be accessed with get_dEHH_max() and get_dEHH_mean(), respectively.

    Reference: Ramírez-Soriano A., S.E. Ramos-Onsins, J. Rozas, F. Calafell & A. Navarro. 2008. Statistical power analysis of neutrality tests under demographic expansions, contractions and bottlenecks with recombination. Genetics 179 : 555-567.

In all cases, None is returned when the value is not available (no data loaded, division by zero in the cases or ratios, or threshold not reached in the case of decay statistics.

The thresholds must all be within the range between 0 and 1.

See the statistics notice for a (very) formal definition of statistics.

cur_haplotypes

Current number of haplotypes (None if core has not been set, equal to num_haplotypes if not distant site has been loaded). Not available for unphased data.

done_EHH()

Return True if the values of iHH for all core haplotypes have completed integrating (and all dEHH values have been evaluated). Not available for unphased data.

done_EHHc()

Return True if the values of iHHc for all core haplotypes have completed integrating (and all dEHH values have been evaluated). Not available for unphased data.

get_EHH(i)

Get the EHH value for the last processed distant site for core haplotype i. Return None if the value cannot be computed (no available samples). Not available for unphased data.

get_EHHS()

Get the EHHS value for the last processed distant site. Return None if the value cannot be computed (no available samples).

get_EHHc(i)

Get the EHHc value for the last processed distant site for core haplotype i. Return None if the value cannot be computed (no available samples). Not available for unphased data.

get_dEHH(i)

Get the EHH decay distance for core haplotype i. Return None if the EHH threshold has not been reached. Not available for unphased data.

get_dEHHS()

Get the EHHS decay distance. Return None if the EHHS threshold has not been reached.

get_dEHH_max()

Get the maximum EHH decay distance across core haplotypes. Return None if the EHH threshold has not been reached for at least one of the core haplotypes. Not available for unphased data.

get_dEHH_mean()

Get the average EHH decay distance across core haplotypes. Return None if the EHH threshold has not been reached for at least one of the core haplotypes. Not available for unphased data.

get_dEHHc(i)

Get the EHHc decay distance for core haplotype i. Return None if the EHHc threshold has not been reached. Not available for unphased data.

get_iES()

Get the iES value for the last processed distant site. Return None if the value cannot be computed (no available samples).

get_iHH(i)

Get the iHH value for the last processed distant site for core haplotype i . Return None if the value cannot be computed (no available samples). Not available for unphased data.

get_iHHc(i)

Get the iHHc value for the last processed distant site for core haplotype i. Return None if the value cannot be computed (no available samples). Not available for unphased data.

get_iHS(i)

Get the iHS value for the last processed distant site for core haplotype i. Return None if the ratio cannot be computed (no available sample or division by zero). Not available for unphased data.

get_rEHH(i)

Get the rEHH value for the last processed distant site for core haplotype i. Return None if the ratio cannot be computed (no available sample or division by zero). Not available for unphased data.

load_distant(site, distance)

Process a distant site. The core site must have been specified.

Parameters:
  • site – a site instance containing data for the distant site. It must be consistent with the core site (same list of samples, and in the same order). Missing data are supported.
  • distance – the distance to the core site. Any distance measure can be used; it is only required that distant sites are loaded with increasing distance.
ncur(hap=None)

Number of non-missing samples at the last loaded (core or distant) site.

Parameters:hap – core haplotype index (available for unphased data, but with little relevance). If None, total number samples.

Similar to nsam().

nsam(hap=None)

Number of non-missing samples. It is not allowed to call this method if core has not been set.

Parameters:hap – core haplotype index (available for unphased data, but with little relevance). If None, total number of samples.
num_haplotypes

Number of core haplotypes taken into consideration (None if core has not been set). Available for unphased data but with little relevance.

set_core(site, unphased=False, min_freq=None, EHH_thr=None, EHHc_thr=None, EHHS_thr=None)

Specify the core region haplotypes, as a Site instance, and set parameters. If the instance already contains data, it will be reset.

Parameters:
  • site – a Site instance. It is assumed that data are phased (samples should be phased; but it is possible to load genotypes whose phase is unknown, as long as the genotypes are phased at the individual level: see unphased).
  • unphasedTrue if genotypic data are provided and if unphased versions of statistics should be computed. This option requires that ploidy is more than 1. It is still required that individuals are phased (that is, that they are entered in the same order in all distant sites).
  • min_freq – minimal absolute frequency for haplotypes (haplotypes with lower frequencies are ignored). By default, all haplotypes are considered.
  • EHH_thr – threshold determining when iHH should stop integrating and dEHH must be evaluated. Must be None if genotypes is True. By default (None), iHH is permanently incremented and dEHH is not evaluated at all.
  • EHHc_thr – threshold determining when iHH should stop integrating and dEHH must be evaluated. Must be None if genotypes is True. By default (if None), use the same value as for EHH_thr.
  • EHHS_thr – threshold determining when iES should stop integrating and dEHHS must be evaluated. Must be None if genotypes is True. By default (None), iES is permanently incremented and dEHHS is not evaluated at all.
tot_haplotypes

Total number of core haplotypes (including those ignored; None if core has not been set). Available for unphased data, but with little relevance.

Misorientation probability

class egglib.stats.ProbaMisoriented(align=None)

Estimate the misorientation probability from a user-provided set of sites with a fixed outgroup. Only sites that are either variable within the ingroup or have a fixed difference with respect to the outgroup are considered. Sites with more than two different alleles in the ingroup, or more than one allele in the outgroup, are ignored.

This function is an implementation of the method mentioned in Baudry and Depaulis (2003), allowing to estimate the probability that a site oriented using the provided outgroup have be misoriented due to a homoplasic mutation in the branch leading to the outgroup. Note that this estimation neglects the probability of shared polymorphism.

Reference: Baudry, E. & F. Depaulis. 2003. Effect of misoriented sites on neutrality tests with outgroup. Genetics  165: 1619-1622.

Parameters:align – a Align containing the sites to analyse.

If the instance is created with an alignment as constructor argument, then the statistics are computed. The method load_align() does the same from an existing ProbaMisoriented instance. Otherwise, individual sites can be loaded with load_site(), and then the statistics can be computed using compute() (the latter is preferable if generate Freq instances for an other use).

New in version 3.0.0.

D

Number of loaded sites with a fixed difference with respect to the outgroup.

S

Number of loaded polymorphic sites (within the ingroup).

TiTv

Ratio of transition and transversion rates ratio. None if the value cannot be computed (no loaded data or null transversion rate). Requires that compute() has been called.

compute()

Compute pM and TiTv statistics. Requires that sites have been loaded using load_site(). This method does not reset the instance.

load_align(align)

Load all sites of align that meet criteria. If there are previously loaded data, they are discarded. This method computes statistics. Data are required to be DNA sequences.

Parameters:align – an Align instance.
load_site(freq)

Load a single site. If there are previously loaded data, they are retained. To actually compute the misorientation probability, the user must call compute().

Parameters:site – a Freq instance.
pM

Probability of misorientation. None if the value cannot be computed (no loaded data, no valid polymorphism, null transversion rate). Requires that compute() has been called.

reset()

Clear all loaded or computed data.

ComputeStats

The class ComputeStats allows to compute all available statistics repeatitively on several loci and is the preferred way to perform diversity analyses.

class egglib.stats.ComputeStats(*args, **kwargs)

This class allows customizable and efficient analysis of diversity data. It is designed to minimize redundancy of underlying analyses so it is best to compute as many statistics as possible with a single instance. It also takes advantage of the object reuse policy, improving the efficiency of analyses when several (and especially many) datasets are examined in a row by the same instance of ComputeStats.

The constructor takes arguments that are automatically passed to the configure() method.

Statistics to compute are set using the method add_stats() which allows specifying several statistics at once and which can also be called several times to add more statistics to compute.

add_stats(stat, ...)

Add one or more statistics to compute. Every statistic identifier must be among the list of available statistics, regardless of what data is to be analyzed. If statistics cannot be computed, they will be returned as None. Also reset all currently computed statistics (if any).

all_stats()

Add all possible statistics. Those who cannot be computed will be reported as None. Also reset all currently computed statistics (if any).

clear_stats()

Clear the list of statistics to compute. Note that this automatically resets the instance and clears the already computed stats.

configure(only_diallelic=True, consider_genotype_ordering=False, LD_min_n=2, LD_max_maj=1.0, LD_multiallelic=0, LD_min_freq=0, Rmin_oriented=False, multiple=False)

Configure the instance. The values provided for parameters will affect all subsequent analyses.

Parameters:
  • only_diallelic – Ignore all sites with more than two alleles, considering only alleles present in the ingroup.
  • LD_min_n – Minimal number of non-missing samples. Allows to specify a more stringent filter than max_missing. Only considered for calculating Rozas et al.‘s and Kelly’s statistics (the most stringent of the two criteria applies).
  • LD_max_maj – Maximal relative frequency of the main allele. Only considered for calculating Rozas et al.‘s and Kelly’s statistics.
  • LD_multiallelic – One of 0 (ignore them), 1 (use main allele only), and 2 (use all possible pairs of alleles). Defines what is done for pairs of sites of which one or both have more than two alleles (while computing linkage disequilibrium). In case of option 2, a filter can be applied with option LD_min_freq. Only considered for calculating Rozas et al.‘s and Kelly’s statistics.
  • LD_min_freq – Only considered if option 2 is used for LD_multiallelic. Only consider alleles that are in absolute frequency equal to or larger than the given value. Only considered for calculating Rozas et al.‘s and Kelly’s statistics
  • multiple – allow multiple mutation at the same site.
  • Rmin_oriented – Only for computing Rmin: use only orientable sites.
Consider_genotype_ordering:
 

consider the order of alleles in a genotype as significant. Only significant if the ploidy of the provided site is larger than 1.

list_stats()

Returns a list of tuples giving, for each available statistic, its code and a short description.

process_align(align, positions=None, struct=None, filtr=None, max_missing=0.0, consider_outgroup_missing=False, multi=False, ignore_ns=False)

Analyze an alignment.

Parameters:
  • align – an Align instance.
  • positions – a list, or other sequence of positions for all sites, or None. If None, use the index of each site. Otherwise, must be a sequences of integer values (length matching the number of sites).
  • struct – a Structure instance describing population structure. struct can also be an integer specifying the level of population labels in the alignment labels table. Only populations can be specified this way. If None (default), don’t use structure. Warning: for Fst, Kst, and Snn statistics and if several alignments/sets of sites are combined the structure must be passed again to results().
  • filtrFilter instance determining what allelic values are acceptable and what ones should be considered as missing (all other values causing an exception). By default, nucleotide sequences are accepted (including IUPAC ambiguity characters as missing data), with case-independent matching. The user may want to use a custom instance or use one of the predefined instances in the stats module).
  • max_missing – Maximum proportion of missing data. The default is to exclude all sites with missing data. Missing data include all ambiguity characters and alignment gaps. Sites not passing this threshold are ignored for all analyses.
  • consider_outgroup_missing – if True, take the outgroup into account in the max_missing threshold (by default, only ingroup samples are considered).
  • multi – multi-alignment mode. If True, don’t reset the instance (use all loaded data since last reset, if any) and does not return statistics. (Statistics will be available for all loaded sites using results().) By default, reset the instance before processing sites, return statistics and reset the instance after processing.
  • ignore_ns – this flag should be set if several Align are to be analyzed together and they don’t have the same total ingroup sample size. Only has an effect for statistics involved the distribution of frequencies of derived alleles.
Returns:

A dictionary of statistics, unless multi is set.

process_freq(frq, alleles=None, position=None, no_return=False)

Analyze a site based on already computed frequencies.

Parameters:
  • frq – a Freq instance.
  • position – position of the site. Must be an integer value. By default, use the site loading index.
  • no_return – don’t return statistics for this site (statistics will be available for all loaded sites using results()).
Params alleles:

list of allele values (as integers). Only used if statistic V is required and ignored otherwise. Statistic V is not computed if this argument is not specified. The length of the list must match the total number of alleles.

Returns:

A dictionary of statistics, unless no_return is set.

process_site(site, position=None, struct=None, no_return=False)

Analyze a site.

Parameters:
  • site – a Site instance.
  • position – position of the site. Must be an integer value. By default, use the site loading index.
  • no_return – don’t return statistics for this site (statistics will be available for all loaded sites using results()).
Struct:

a Structure instance describing population structure. Warning: for Fst, Kst, and Snn statistics, the structure must be passed again to results(). By default, doesn’t apply structure.

Returns:

A dictionary of statistics, unless no_return is set.

process_sites(sites, positions=None, struct=None, multi=False, phased=False)

Analyze a list of sites.

Parameters:
  • sites – a list (or other sequence) of Site instances.
  • positions – a list, or other sequence of positions for all sites, or None. If None, use the index of each site. Otherwise, must be a sequences of integer values (length matching the length of sites).
  • multi – if True, don’t reset the instance (use all loaded data since last reset, if any) and does not return statistics. (Statistics will be available for all loaded sites using results().) By default, reset the instance before processing sites, return statistics and reset the instance after processing.
  • phasedTrue if the provided sites are phased with each other.
Struct:

a Structure instance describing population structure. Warning: for Fst, Kst, and Snn statistics, the structure must be passed again to results() if several sets of sites/alignments are combined. By default, doesn’t apply structure.

Returns:

A dictionary of statistics, unless multi is set.

reset()

Reset all currenctly computed statistics (but keep the list of statistics to compute).

results(struct=None, phased=False)

Return the value of statistics for all sites since the last call to this method, to reset(), or any addition of statistics, or the object creation, whichever is most recent. For statistics that can not be computed, None is returned.

Note

This method never computes statistics linked to linkage disequilibrium (including statistic rD) because those statistics can only be computed if all sites are available at the same method (which is not guaranteed). Those statistics cannot be computed if process_site() is used or if process_sites() is used with the option multi set to``True``.

Parameters:
  • struct – a Structure instance providing the population structure (required if the Fst, Kst, or Snn statistics have been requested and appropriate data were loaded, and ignored otherwise).
  • phased – if True assume that, if multiple alignment have been loaded with the multi option, that the data are phased (this allows computing haplotypic and linkage disequilibrium statistics).

The statistics available for using with ComputeStats are listed below:

List of statistics

Code Description Per site Per region Whole sample Per pop Per pair
Aing Number of alleles in ingroup NA NA NA NA NA
Aotg Number of alleles in outgroup NA NA NA NA NA
As Number of singleton alleles NA NA NA NA NA
Asd Number of singleton alleles (derived) NA NA NA NA NA
Atot Number of alleles in whole dataset NA NA NA NA NA
B Wall’s B statistic NA NA NA NA NA
Ch Ramos-Onsins and Rozas’s Ch (using singletons) NA NA NA NA NA
ChE Ramos-Onsins and Rozas’s ChE (using external singletons) NA NA NA NA NA
D Tajima’s D NA NA NA NA NA
Da Net pairwise distance (if two populations) NA NA NA NA NA
Deta Tajima’s D using eta instead of S NA NA NA NA NA
Dfl Fu and Li’s D NA NA NA NA NA
Dj Jost’s D NA NA NA NA NA
Dstar Fu and Li’s D* NA NA NA NA NA
Dxy Pairwise distance (if two populations) NA NA NA NA NA
E Zeng et al.’s E NA NA NA NA NA
F Fu and Li’s F NA NA NA NA NA
Fis Inbreeding coefficient NA NA NA NA NA
Fs Fu’s Fs NA NA NA NA NA
Fst Hudson’s Fst NA NA NA NA NA
Fstar Fu and Li’s F* NA NA NA NA NA
Gst Nei’s Gst NA NA NA NA NA
Gste Hedrick’s Gst’ NA NA NA NA NA
He Expected heterozygosity NA NA NA NA NA
Hi Inter-individual heterozygosity NA NA NA NA NA
Hns Fay and Wu’s H (unstandardized) NA NA NA NA NA
Ho Observed heterozygosity NA NA NA NA NA
Hsd Fay and Wu’s H (standardized) NA NA NA NA NA
Hst Hudson’s Hst NA NA NA NA NA
K Number of haplotypes NA NA NA NA NA
Ke Number of haplotypes (only ingroup) NA NA NA NA NA
Kst Hudson’s Kst NA NA NA NA NA
Pi Nucleotide diversity NA NA NA NA NA
Q Wall’s Q statistic NA NA NA NA NA
R Allelic richness NA NA NA NA NA
R2 Ramos-Onsins and Rozas’s R2 (using singletons) NA NA NA NA NA
R2E Ramos-Onsins and Rozas’s R2E (using external singletons) NA NA NA NA NA
R3 Ramos-Onsins and Rozas’s R3 (using singletons) NA NA NA NA NA
R3E Ramos-Onsins and Rozas’s R3E (using external singletons) NA NA NA NA NA
R4 Ramos-Onsins and Rozas’s R4 (using singletons) NA NA NA NA NA
R4E Ramos-Onsins and Rozas’s R4E (using external singletons) NA NA NA NA NA
Rintervals List of start/end positions of recombination intervals NA NA NA NA NA
Rmin Minimal number of recombination events NA NA NA NA NA
RminL Number of sites used to compute Rmin NA NA NA NA NA
S Number of segregating sites NA NA NA NA NA
Snn Hudson’s nearest nearest neighbour statistic NA NA NA NA NA
So Number of segregating orientable sites NA NA NA NA NA
Ss Number of sites with only one singleton allele NA NA NA NA NA
Sso Number of orientable sites with only one singleton allele NA NA NA NA NA
V Allele size variance NA NA NA NA NA
WCisct Weir and Cockerham for hierarchical structure NA NA NA NA NA
WCist Weir and Cockerham for diploid data NA NA NA NA NA
WCst Weir and Cockerham for haploid data NA NA NA NA NA
Z*nS Kelly et al.’s Z*nS NA NA NA NA NA
Z*nS* Kelly et al.’s Z*nS* NA NA NA NA NA
ZZ Rozas et al.’s ZZ NA NA NA NA NA
Za Rozas et al.’s Za NA NA NA NA NA
ZnS Kelly et al.’s ZnS NA NA NA NA NA
eta Minimal number of mutations NA NA NA NA NA
etao Minimal number of mutations are orientable sites NA NA NA NA NA
lseff Number of analysed sites NA NA NA NA NA
lseffo Number of analysed orientable sites NA NA NA NA NA
nM Number of sites available for MFDM test NA NA NA NA NA
nPairs Number of allele pairs used for ZnS, Z*nS, and Z*nS* NA NA NA NA NA
nPairsAdj Allele pairs at adjacent sites (used for ZZ and Za) NA NA NA NA NA
ns_site Number of analyzed samples per site NA NA NA NA NA
nseff Average number of exploitable samples NA NA NA NA NA
nseffo Average number of exploitable samples at orientable sites NA NA NA NA NA
nsingld Number of derived singletons NA NA NA NA NA
nsmax Maximal number of available samples per site NA NA NA NA NA
nsmaxo Maximal number of available samples per orientable site NA NA NA NA NA
numFxA Number of fixed alleles NA NA NA NA NA
numFxA* Sites with at least 1 fixed allele NA NA NA NA NA
numFxD Number of fixed differences NA NA NA NA NA
numFxD* Sites with at least 1 fixed difference NA NA NA NA NA
numShA Number of shared alleles NA NA NA NA NA
numShA* Sites with at least 1 shared allele NA NA NA NA NA
numShP Number of shared segregating alleles NA NA NA NA NA
numShP* Sites with at least 1 shared segregating allele NA NA NA NA NA
numSp Number of population-specific alleles NA NA NA NA NA
numSp* Sites with at least 1 pop-specific allele NA NA NA NA NA
numSpd Number of population-specific derived alleles NA NA NA NA NA
numSpd* Sites with at least 1 pop-specific derived allele NA NA NA NA NA
pM P-value of MDFM test NA NA NA NA NA
rD R_bar{d} statistic NA NA NA NA NA
singl Index of sites with at least one singleton allele NA NA NA NA NA
singl_o Index of sites with at least one singleton allele NA NA NA NA NA
sites Index of polymorphic sites NA NA NA NA NA
sites_o Index of orientable polymorphic sites NA NA NA NA NA
thetaH Fay and Wu’s estimator of theta NA NA NA NA NA
thetaIAM Theta estimator based on He & IAM model NA NA NA NA NA
thetaL Zeng et al.’s estimator of theta NA NA NA NA NA
thetaPi Pi using orientable sites NA NA NA NA NA
thetaSMM Theta estimator based on He & SMM model NA NA NA NA NA
thetaW Watterson’s estimator of theta NA NA NA NA NA

Structure

class egglib.stats.Structure

Describe the organisation of samples in individuals, populations, and clusters of populations. The structure is necessarily hierarchical and all levels are always defined but it is possible to bypass any level if information for this level is not available or irrelevant. The number of individuals per population can vary, but the number of samples per individuals (that is, the ploidy) must be constant.

New objects are created using the fonctions get_structure() and make_structure(), and old objects can be recycled with the corresponding methods get_structure() and make_structure().

as_dict()

Return a tuple of two dict representing, respectively, the ingroup and outgroup structure.

The ingroup dictionary is a three-fold nested dictionary (meaning it is a dictionary of dictionaries of dictionaries) holding lists of sample indexes. The keys are, respectively, cluster, population, and individual labels. Based on how the instance was created, there may be just one item or even none at all in any dictionary. In practice, if d` is the ingroup :class:`dict` and ``clt, pop and idv are, respectively, cluster, population, and individual labels, the expression d[clt][pop][idv] will yield a list of sample indexes.

The outgroup dictionary is a non-nested dictionary with individual labels as keys and lists of sample indexes as values.

get_structure(data, lvl_clust=None, lvl_pop=None, lvl_indiv=None, pop_filter=None, ploidy=None, skip_outgroup=False)

Reset the instance as if it was built using get_structure(). The definitions of arguments are identical.

make_auxiliary()

Return a new Structure instance describing the organisation of individuals in clusters and populations (ignoring the intra-individual level) using the rank of individuals as indexes, the individuals being ranked in the order of increasing cluster and population indexes.

make_structure(ingroup, outgroup)

Reset the instance as if it was built using make_structure(). The definitions of arguments are identical.

no

Number of processed outgroup samples.

ns

Number of processed ingroup samples.

num_clust

Number of clusters.

num_indiv_ingroup

Total number of ingroup individuals.

num_indiv_outgroup

Number of outgroup individuals.

num_pop

Total number of populations.

ploidy

Ploidy.

req_no

Required number of ougroup sample index in objects using this structure (equal to the largest index overall plus one).

req_ns

Required number of ingroup sample index in objects using this structure (equal to the largest index overall plus one).

egglib.stats.get_structure(data, lvl_clust=None, lvl_pop=None, lvl_indiv=None, pop_filter=None, ploidy=None, skip_outgroup=False)

Create a new Structure instance based on the group labels of a Align or Container instance.

Parameters:
  • data – an Align or Container instance containing the grouping levels to be processed.
  • lvl_clust – index of the grouping level containing cluster labels. If None, all populations are placed in a single cluster with label 0.
  • lvl_pop – index of the grouping level containing population labels. If None, all individuals of a cluster are placed in a single cluster with the same label as their cluster.
  • lvl_indiv – index of the grouping level containing individual labels. If None, individuals are not specified and each sample is placed in a haploid individual, for both the ingroup and the outgroup (unless skip_outgroup is True), meaning that outgroup group labels are ignored.
  • pop_filter – process only the populations bearing the label or labels provided in the provided list. If None, all populations are processed. An empty list is the same as None. It is allowed to include repeated labels in the list, as well as labels that are not actually present in the data.
  • ploidy – indicate the ploidy. Must be a positive number and, if specified, data must match the value. If not specified, ploidy will be detected automatically (it must still be consistent over all ingroup and outgroup individuals). Ploidy is ignored if lvl_indiv is None.
  • skip_outgroup – specify if outgroup samples should be skipped. No effect if there are no outgroup samples. If lvl_indiv is not None, the group label of outgroup samples are considered to be individual labels and the outgroup individuals will be recorded (requiring a consistent ploidy as well). If lvl_indiv is None, outgroup samples are imported in one-sample individuals just like ingroup samples. If the skip_outgroup flag is set, outgroup samples are not imported at all.
Returns:

A new Structure.

egglib.stats.make_structure(ingroup, outgroup)

Create a new Structure instance based based on the structure provided as dictionaries. The two arguments must match the format of the return value of the Structure.as_dict() method (see the documentation). Either argument can be replaced by None which is equivalent to an empty dictionary (no samples). Not that all keys must be positive integers.

Parameters:
  • ingroup – a three-fold nested dictionary of ingroup samples indexes, or None.
  • outgroup – a dictionary of outgroup samples indexes, or None.
Returns:

A new Structure.

Filter

This class allows to generate filters for analysis of allele values:

class egglib.stats.Filter(exploitable=None, rng=None, missing=None, exploitable_alias=None, missing_alias=None, lower_alias=False)

Holds lists of valid (exploitable and missing) data codes. The constructor allows to immediately instanciate the class with all desired information. It is also possible to call the modifier methods to add information. Note that this module provides a few pre-defined instances with common settings that the user may use and possibly extend.

Note

If no codes are specified (which is the default behaviour if no arguments are specified), the filter will treat all codes as acceptable.

Parameters:
  • exploitable – Value or values used as exploitable data. Possible values are None, a single integer, a list (or other iterable) of integers, or a string. An iterable of one-character strings is also possible. Negative integers are supported.
  • rng – A range of integer values taken as exploitable. Must contain two values, providing the minimum and maximum values.
  • missing – Value or values used as missing data. Specifications as for exploitable.
  • lower_alias – For all exploitable and missing (but not rng) values, use the lower case character as alias. Ignored for all integer values.
  • exploitable_alias – Provide the alias values for exploitable data. The number of values and their order must match the argument exploitable.
  • missing_alias – Provide the alias values for missing data. The number of values and their order must match the argument missing.
exploitable(values, alias=None, lower_alias=False)

Add exploitable values. See the class documentation for more details.

missing(values, alias=None, lower_alias=False)

Add missing values. See the class documentation for more details.

rng(mini, maxi)

Add a range of exploitable data. See the class documentation for more details.

Pre-defined instances are available. They are all extendable:

egglib.stats.filter_default

All possible allele values are accepted as valid. This filter is case-sensitive.

egglib.stats.filter_dna

Used for DNA sequences (case-independent). Exploitable: ACGT. Missing: RYSWKMBDHVN?-. Lower-case characters are mapped to the corresponding upper-case character.

egglib.stats.filter_rna

Used for RNA sequences (case-independent). Exploitable: ACGU. Missing: RYSWKMBDHVN?-. The IUPAC codes for missing data are identical in meaning to those for DNA sequences. Lower-case characters are mapped to the corresponding upper-case character.

egglib.stats.filter_strict

Paranoid filter for DNA sequences. Exploitable: ACGT. Missing: N. Lower-case characters are not allowed.

egglib.stats.filter_ssr

Used for strictly positive SSR alleles. Exploitable: range 1–999. Missing: -1.

egglib.stats.filter_num

Used for numerical values. Exploitable: range -999–999. Missing: none.

egglib.stats.filter_codon

Used for numerical codon representation. Exploitable: range 0–63. Missing: 64.

egglib.stats.filter_amino

Used for amino acid sequences. Exploitable: ACDEFGHIKLMNPQRSTVWY. Missing: X-. Lower-case characters are mapped to the corresponding upper-case character.

egglib.stats.filter_codon

Used for numerical codon representation. Exploitable: range 0–63. Missing: 64.

Constants

Here are three variables determining what is done for computing linkage disequilibrium between a pair of sites of which at least one has more than two alleles.

egglib.stats.multiallelic_ignore

Skip pairs of sites when at least one has more than two alleles.

egglib.stats.multiallelic_use_main

If more than two alleles, use the most frequent one only.

egglib.stats.multiallelic_use_all

If more than two alleles, use all possible pairs of alleles.