Store allelic and genotypic data at a single site. Allelic values are represented by integers. In case of complex, such as structural, variation, one can use the index of variants as allelic values, and some objects can lack allelic value information (a number of alleles of 0 will be reported).
Loading data is incremental: processed new data will be added to previously loaded data (requiring that ploidy matches). To avoid this, use reset().
Generate the list of alleles
Generate one or two lists containing data from the instance.
Parameters: 


Returns:  lists of allelic index (or None for missing data). the return value is either a single or two lists (based on the value of skip_outgroup), and the list or lists contain tuples representing individuals unless flat is True. 
Get a given ingroup genotype or allele, as allele indexes.
Parameters: 


If chrom is omitted, return the genotype as a tuple. Otherwise, return the specific allele. Setting chrom to None is the same as omitting it. chrom can be safely omitted for haploid data.
Total number of samples (alleles) in the outgroup.
Total number of samples (alleles) in the ingroup.
Number of alleles.
Total number of individuals in the ingroup.
Number of missing data (expressed in number of samples).
Number of missing data (expressed in number of samples) in the ingroup only.
Number of missing data (expressed in number of samples) in the outgroup only.
Total number of individuals in the outroup.
outroup(indiv[, chrom])
Get a given outgroup genotype or allele, as allele indexes.
Parameters: 


If chrom is omitted, return the genotype as a tuple. Otherwise, return the specific allele. Setting chrom to None is the same as omitting it. chrom can be safely omitted for haploid data.
Current value of the ploidy (to change it, one needs to reset the instance).
Import data from the provided Align to the data currently held by the instance. Arguments are identical to the function site_from_align(), expect reset.
Parameters:  reset – if True, reset the instance as if newly created. If False, append the data to current data, if any. 

If reset is False and this instance currently holds data, the ploidy defined by the struct argument is required to match the current value. If struct is None, the implied ploidy is 1 and is still required to match.
Import data from the provided lists to the data currently held by the instance. Arguments are identical to the function site_from_list(), expect reset.
Parameters:  reset – if True, reset the instance as if newly created. If False, append the data to current data, if any. 

If reset is False and this instance currently holds data, the ploidy defined by the input data is required to match the current value. If flat is True, the implied ploidy is 1 and is still required to match.
Import data from the provided VCF parser to the data currently held by the instance. Arguments are identical to the function site_from_vcf(), expect reset.
Parameters:  reset – if True, reset the instance as if newly created. If False, append the data to current data, if any. 

If reset is False and this instance currently holds data, the ploidy defined by the input data is required to match the current value. If flat is True, the implied ploidy is 1 and is still required to match.
Warning
VCF genotypes are exported as allele indexes (0 for the reference allele). This function treats them as allele values, meaning that they might be shifted (0 is the first allele found, which is not necessarily the reference allele). Be aware of this fact when appending VCF data to data from other sources (including other VCF files) in the same site data, or when processing individual alleles.
Clear all data from the instance.
Import allelic and genotypic data from a position of the provided Align instance. The struct argument allows to process only a subset of the samples, and also controls the genotypic struct and the ploidy. This means that the same Structure must be used again to further process the resulting Site instance.
Parameters: 


Returns:  A new Site instance. The numbers of ingroup and outgroup items of this instance are defined by the Structure instance passed as struct and can be smaller than the number of samples of the original alignment. 
Import allelic and genotpyic data from provided lists. Input data are equivalent to the return value of Site.as_list(). ingroup and outgroup provide data for the ingroup and outgroup respectively. Either can be replaced by None which is equivalent to an empty list. They are supposed to be lists of tuples if flat is False, and lists of integers otherwise, but lists and tuples can be replaced by other sequence types.
Parameters: 


Import allelic and genotypic data from a VCF parser. The VCF parser must have processed a variant and the variant is required to have genotypic data available as the GT format field. An exception is raised otherwise.
Warning
VCF genotypes are exported as allele indexes (0 for the reference allele). This function treats them as allele values, meaning that they might be shifted (0 is the first allele found, which is not necessarily the reference allele). Be aware of this fact when appending VCF data to data from other sources (including other VCF files) in the same site data, or when processing individual alleles.
Parameters: 


Hold allelic and genotypic frequencies for a single site. .Freq instances can be created using the three functions freq_from_site(), freq_from_list(), and freq_from_vcf(), or using the default constructor. After it is created by any way, instances can be reused (which is faster), using their methods process_site(), process_list(), and process_vcf().
Get the frequency of an allele.
Parameters: 


Get the frequency of an genotype.
Parameters: 


Get a genotype, as a tuple of allele indexes.
Get the number of individuals within a given compartment. In the haploid case, this method is identical to nseff().
Parameters: 


Get the number of samples within a given compartment.
Parameters: 


Number of alleles in the whole site.
Number of clusters.
Number of genotypes in the whole site.
Number of populations.
Ploidy
Reset the instance as if it had been created using freq_from_list(). Arguments are identical to this function.
Reset the instance as if it had been created using freq_from_site(). Arguments are identical to this function.
Reset the instance as if it had been created using freq_from_vcf(). Argument is identical to this function.
Create a new Freq instance based on data of the provided site.
Parameters: 


Returns:  A new Freq instance. 
Create a new Freq instance based on already computed frequency data.
Parameters: 


Returns:  A new Freq instance. 
Note that it is required that there is at least one cluster and one population.
Import allelic frequencies from a VCF parser. The VCF parser must have processed a variant and the variant is required to have frequency data available as the AC format field along with the AN field. An exception is raised otherwise.
This function only imports haploid allele frequencies in the ingroup (without structure). The first allele is the reference, by construction, then all the alternate alleles in the order in which they are provided in the VCF file.
Parameters:  vcf – a VcfParser instance containing data. There must at least one sample and the AN/AC format fields must be available. It is not required to extract variant data manually. 

Compute diversity statistics for a gene family (Innan 2003). An estimate of genetic diversity is provided for every paralog and for every pair of paralogs, provided that enough nonmissing data is available (at least 2 samples are required). Note that sites with more than two alleles are always considered.
Parameters: 


Returns:  A new ParalogPi instance which provides methods to access the number of used sites and the diversity for each paralog/paralog pair. 
Class computing Innan’s within and betweenparalog diversity statistics. See paralog_pi() for more details. This class can be used directly (1) to analyse data with more efficiency (by reusing the same instance) or (2) to combine data from different alignments, or (3) for pass individual sites. Do first call setup().
Get betweenparalog diversity for paralogs i and j.
Get withinparalog diversity for paralog i.
Number of sites with any data (without arguments), with data for paralog i (if only i specified), or with data for the pair of paralogs i and j (if both specified).
Process an alignment matching the structure passed to setup(). Diversity estimates are incremented (no reset).
Parameters: 


Process a site matching the structure passed to setup(). Diversity estimates are incremented (no reset).
Parameters:  site – a Site instance. 

Specify the structure in paralog and individuals. The two arguments are Structure instances as described for paralog_pi(). Only this method resets the instance.
Identify haplotypes from sites provided as either an Align instance or a list of Site instances, and return data as a single Site instance containing one sample for each sample of the original data. Alleles in the returned site are representing all identified haplotypes (or missing data when the haplotypes could not be derived.
Note
There must be at least one site with at least two alleles (overall, including the outgroup), otherwise the produced site only contains missing data.
Parameters: 


Returns:  A Site instance, (if dest is None) or None (otherwise). 
This class processes alignments with a reading frame specification in order to detect synonymous and nonsynonymous variable positions. It provides basic statistics, but it can also filter data to let the user compute all other statistics on synonymousonly, or nonsynonymousonly variation (e.g. or D).
The constructor takes optional arguments. By default, build an empty instance. If arguments are passed, they must match the signature of process() that will be called.
The method process() does all the work. Once it is called, data are available through the different instance attributes, and it is possible to generate alignments containing only codon sites with either one synonymous or one nonsynonymous mutation. It is also possible to iterate over sites of both kind. In both cases, the generated data contains only codons, where each codon is represented by a single integer (see the methods tools.int2codon()). These data can be analysed in the module stats using the predefined Filter instance stats.filter_codon.
Note that, currently, the outgroup is ignored.
Iterate over nonsynonymous sites. Mostly similar to the method iter_S() (see important warning about the fact that returned values are actually always the same SiteFrequency instance that is updated at each iteration round).
Iterate over synonymous sites. Proposed as a more performant alternative to mk_align_S(). This method returns a generator that can be used in expressions such as:
cs = egglib.stats.ComputeStats()
cs.add_stat('thetaW')
thetaW = 0.0
cdiv = egglib.stats.CodingDiversity(align, frame)
for site in cdiv.iter_S():
stats = cs.process_site(site)
thetaW += stats['thetaW']
Warning
This method returns repetively the same SiteFrequency with information updated at each round. The reason for this is performance. Never keep a reference to the iterator variable (that’s site in the example above) outside the loop body.
Create an Align instance with only nonsynonymous codon sites. Mostly similar to the method mk_align_S().
Create an Align instance with only synonymous codon sites. The alignment contains the same number of ingroup and outgroup samples as the original alignment, and a number of sites equal to num_pol_S. Note that the returned alignment does not have group labels. These data can be analysed in the module stats using the predefined Filter instance stats.filter_codon.
Number of codon sites that have been analysed (like num_codons_tot but excluding sites rejected because of missing data).
Number of codon sites (among those that have been analysed) with at least one codon stop in them.
Total number of considered codon sites that have been co. Only complete codons have been considered, but this value includes codons that have been rejected because of missing data.
Number of polymorphic coding sites with more than two alleles. These sites are included only if multiple_alleles is True except those who mix synonymous and nonsynonymous changes (they can be rejected if there are more than two alleles in total as well).
Number of polymorphic codons for which more than one position is changed. These sites are included only if multiple_hits is True and depdenting on the total number of alleles.
Number of polymorphic codon sites with only one nonsynonymous mutation.
Number of polymorphic codon sites with only one synonymous mutation.
Number of polymorphic coding sites with only one mutation. All these sites are always included.
Estimated number of nonsynonymous sites. Note that the total number of sites per codon is always 3.
Estimated number of synonymous sites. Note that the total number of sites per codon is always 3.
Process an alignment. It this instance already had data in memory, they will all be erased.
Parameters: 


This function computes linkage disequilibium between a pair of loci.
Site instances must be used to describe loci. Only the order of samples is considered and both loci must have the same number of samples.
Parameters: 


Returns:  A dictionary of linkage disequilibrium statistics. In case statistics cannot be computed (either site fixed, or less than two samples with nonmissing data at both sites), computed values are replaced by None. n gives the number of pairs of alleles considered. 
Compute the matrix of linkage disequilibrium statistics between all pairs of sites of the provided alignment. The computed statistics are selected by an argument of this function. Return a matrix (as a nested list) of the requested statistics. In all cases, all pairs of sites are present in the returned matrices. If statistics cannot be computed, they are replaced by None.
The available statistics are:
Parameters: 


Returns:  Returns a tuple with two items: first is the list of positions of sites used in the matrix (a subset of the sites of the provided alignment), with positions provided by the corresponding argument (by default, the index of sites); second is the matrix, as the nested lower half matrix. The matrix contains items for all i and j indexes with 0 <= j <= i < n where n is the number of retained sites. The content of the matrix is represented by a single value (if a single statistic has been requested) or as a list of 1 or more values (if a list of 1 or more, accordingly, statistics have been requested), or None for the diagonal or if the pairwise comparison was dropped for any reason. 
This class computes Extended Haplotype Homozygosity statistics and derivatives. Statistics can be computed for unphased genotypic data.
The usage of this class is to: first, set the core haplotypes by passing a Site to set_core(), and then load repetitively distant sites (always with increasing distance from the core), through load_distant(), until the list of sites to process is exhausted or one of the thresholds has been reached.
In order to process distant sites using the same core region to the opposite direction, or to use a different core region, it is always required to call set_core() again (this is the only way to reset the instance).
After at least one distant site site is loaded, EHH statistics can be accessed using there accessors. EHH statistics fell into four categories:
Raw EHH statistics, provided for each loaded distant sites and separately for each core haplotype. See the methods: get_EHH(), get_EHHc(), and get_rEHH().
Reference: Sabeti P.C., D.E. Reich, J.M. Higgins, H.Z.P. Levine, D.J. Richter, S.F. Schaffner, S.B. Gabriel, J.V. Platko, N.J. Patterson, G.J. McDonald, H.C. Ackerman, S.J. Campbell, D. Altshuler, R. Cooper, D. Kwiatkowski, R. Ward & E.S. Lander. 2002. Detecting recent positive selection in the human genome from haplotype structure. Nature 419: 832837.
Integrated EHH statistics, computed separately for each core haplotype and incremented at each provided sites until the EHH value reaches the thresholds provided as the EHH_thr and EHHc_thr arguments to set_core(). See the methods get_iHH(), get_iHHc(), and get_iHS(). The methods done_EHH() and done_EHHc() allow to check whether the threshold has been reached for all genotypes.
Reference: Voight B.F., S. Kudaravalli, X. Wen & J.K. Pritchard. 2006. A map of recent positive selection in the human genome. PLoS Biol 4: e772.
Wholesite EHHS statistic and its integrated statistic iES, which is incremented while the EHHS value is larger than or equal to the threshold provided as the EHHS_thr argument to set_core(). See the methods get_EHHS() and get_iES(). If data are unphased, as specific EHHS estimate based on homozygosity is provided (EHHG, with its iEG integrated countedpart).
Reference: Tang K., K.R. Thornton & M. Stoneking. 2007. A new approach for using genome scans to detect recent positive selection in the human genome. PLoS Biol. 5: e171.
EHH, EHHS, and EHHG decay statistics, that give the minimal distance at which, respectively, EHH starts to be smaller than the threshold provided as the EHH_thr argument to set_core(), EHHS starts to be smaller than the threshold provided as the EHHS_thr argument to set_core(), and EHHG starts to be smaller than the threshold provided as the EHHG_thr argument to set_core(). EHH decay is computed separately for each core haplotype. See the methods get_dEHH(), get_dEHHS(), and get_dEHHG(). The values are not available until the respective threshold is reached (None is returned). The method done_dEHH() allows to check whether the threshold for the EHH decay has been reached for all core haplotypes. The maximum and average value of the EHH decay across all core haplotypes can be accessed with get_dEHH_max() and get_dEHH_mean(), respectively.
Reference: RamírezSoriano A., S.E. RamosOnsins, J. Rozas, F. Calafell & A. Navarro. 2008. Statistical power analysis of neutrality tests under demographic expansions, contractions and bottlenecks with recombination. Genetics 179 : 555567.
In all cases, None is returned when the value is not available (no data loaded, division by zero in the cases or ratios, or threshold not reached in the case of decay statistics.
The thresholds must all be within the range between 0 and 1.
Current number of haplotypes (None if core has not been set, equal to num_haplotypes if not distant site has been loaded).
Return True if the values of IHH for all core haplotypes have completed integrating (and all dEHH values have been evaluated).
Return True if the values of IHHc for all core haplotypes have completed integrating (and all dEHH values have been evaluated).
Get the EHH value for the last processed distant site for core haplotype i. Return None if the value cannot be computed (no available samples).
Get the EHHS (computed with genotypes) value for the last processed distant site. Return None if the value cannot be computed (no available samples, no unphased option used).
Get the EHHS value for the last processed distant site. Return None if the value cannot be computed (no available samples).
Get the EHHc value for the last processed distant site for core haplotype i. Return None if the value cannot be computed (no available samples).
Get the IHH value for the last processed distant site for core haplotype i . Return None if the value cannot be computed (no available samples).
Get the IHHc value for the last processed distant site for core haplotype i. Return None if the value cannot be computed (no available samples).
Get the EHH decay distance for core haplotype i. Return None if the EHH threshold has not been reached.
Get the EHHS (computed with genotypes) decay distance. Return None if the EHHG threshold has not been reached.
Get the EHHS decay distance. Return None if the EHHS threshold has not been reached.
Get the maximum EHH decay distance across core haplotypes. Return None if the EHH threshold has not been reached for at least one of the core haplotypes.
Get the average EHH decay distance across core haplotypes. Return None if the EHH threshold has not been reached for at least one of the core haplotypes.
Get the EHHc decay distance for core haplotype i. Return None if the EHHc threshold has not been reached.
Get the iES (computed with genotypes) value for the last processed distant site. Return None if the value cannot be computed (no available samples or unphased option was not used).
Get the iES value for the last processed distant site. Return None if the value cannot be computed (no available samples).
Get the iHS value for the last processed distant site for core haplotype i. Return None if the ratio cannot be computed (no available sample or division by zero).
Get the rEHH value for the last processed distant site for core haplotype i. Return None if the ratio cannot be computed (no available sample or division by zero).
Process a distant site. The core site must have been specified.
Parameters: 


Current number of nonmissing samples (total).
Number of nonmissing samples for one of the core haplotypes.
Number of nonmissing samples for one of the current haplotypes.
Number of core haplotypes taken into consideration (None if core has not been set).
Specify the core region haplotypes, as a Site instance, and set parameters. If the instance already contains data, it will be reset.
Parameters: 


Estimate the misorientation probability from a userprovided set of sites with a fixed outgroup. Only sites that are either variable within the ingroup or have a fixed difference with respect to the outgroup are considered. Sites with more than two different alleles in the ingroup, or more than one allele in the outgroup, are ignored.
This function is an implementation of the method mentioned in Baudry and Depaulis (2003), allowing to estimate the probability that a site oriented using the provided outgroup have be misoriented due to a homoplasic mutation in the branch leading to the outgroup. Note that this estimation neglects the probability of shared polymorphism.
Reference: Baudry, E. & F. Depaulis. 2003. Effect of misoriented sites on neutrality tests with outgroup. Genetics 165: 16191622.
Parameters:  align – a Align containing the sites to analyse. 

If the instance is created with an alignment as constructor argument, then the statistics are computed. The method load_align() does the same from an existing ProbaMisoriented instance. Otherwise, individual sites can be loaded with load_site(), and then the statistics can be computed using compute() (the latter is preferable if generate Freq instances for an other use).
New in version 3.0.0.
Number of loaded sites with a fixed difference with respect to the outgroup.
Number of loaded polymorphic sites (within the ingroup).
Ratio of transition and transversion rates ratio. None if the value cannot be computed (no loaded data or null transversion rate). Requires that compute() has been called.
Compute pM and TiTv statistics. Requires that sites have been loaded using load_site(). This method does not reset the instance.
Load all sites of align that meet criteria. If there are previously loaded data, they are discarded. This method computes statistics. Data are required to be DNA sequences.
Parameters:  align – an Align instance. 

Load a single site. If there are previously loaded data, they are retained. To actually compute the misorientation probability, the user must call compute().
Parameters:  site – a Freq instance. 

Probability of misorientation. None if the value cannot be computed (no loaded data, no valid polymorphism, null transversion rate). Requires that compute() has been called.
Clear all loaded or computed data.
The class ComputeStats allows to compute all available statistics repeatitively on several loci and is the preferred way to perform diversity analyses.
This class allows customizable and efficient analysis of diversity data. It is designed to minimize redundancy of underlying analyses so it is best to compute as many statistics as possible with a single instance. It also takes advantage of the object reuse policy, improving the efficiency of analyses when several (and especially many) datasets are examined in a row by the same instance of ComputeStats.
The constructor takes arguments that are automatically passed to the configure() method.
Statistics to compute are set using the method add_stats() which allows specifying several statistics at once and which can also be called several times to add more statistics to compute.
Add one or more statistics to compute. Every statistic identifier must be among the list of available statistics, regardless of what data is to be analyzed. If statistics cannot be computed, they will be returned as None. Also reset all currently computed statistics (if any).
Add all possible statistics. Those who cannot be computed will be reported as None. Also reset all currently computed statistics (if any).
Clear the list of statistics to compute. Note that this automatically resets the instance and clears the already computed stats.
Configure the instance. The values provided for parameters will affect all subsequent analyses.
Parameters: 


Returns a list of tuples giving, for each available statistic, its code and a short description.
Analyze an alignment.
Parameters: 


Returns:  A dictionary of statistics, unless multi is set. 
Analyze a site based on already computed frequencies.
Parameters:  

Params alleles:  list of allele values (as integers). Only used if statistic V is required and ignored otherwise. Statistic V is not computed if this argument is not specified. The length of the list must match the total number of alleles. 
Returns:  A dictionary of statistics, unless no_return is set. 
Analyze a site.
Parameters:  

Struct:  a Structure instance describing population structure. Warning: for Fst, Kst, and Snn statistics, the structure must be passed again to results(). By default, doesn’t apply structure. 
Returns:  A dictionary of statistics, unless no_return is set. 
Analyze a list of sites.
Parameters: 


Struct:  a Structure instance describing population structure. Warning: for Fst, Kst, and Snn statistics, the structure must be passed again to results() if several sets of sites/alignments are combined. By default, doesn’t apply structure. 
Returns:  A dictionary of statistics, unless multi is set. 
Reset all currenctly computed statistics (but keep the list of statistics to compute).
Return the value of statistics for all sites since the last call to this method, to reset(), or any addition of statistics, or the object creation, whichever is most recent. For statistics that can not be computed, None is returned.
Note
This method never computes statistics linked to linkage disequilibrium (including statistic rD) because those statistics can only be computed if all sites are available at the same method (which is not guaranteed). Those statistics cannot be computed if process_site() is used or if process_sites() is used with the option multi set to``True``.
Parameters: 


The statistics available for using with ComputeStats are listed below:
Code  Description  Per site  Per region  Whole sample  Per pop  Per pair 

Aing  Number of alleles in ingroup  NA  NA  NA  NA  NA 
Aotg  Number of alleles in outgroup  NA  NA  NA  NA  NA 
As  Number of singleton alleles  NA  NA  NA  NA  NA 
Asd  Number of singleton alleles (derived)  NA  NA  NA  NA  NA 
Atot  Number of alleles in whole dataset  NA  NA  NA  NA  NA 
B  Wall’s B statistic  NA  NA  NA  NA  NA 
Ch  RamosOnsins and Rozas’s Ch (using singletons)  NA  NA  NA  NA  NA 
ChE  RamosOnsins and Rozas’s ChE (using external singletons)  NA  NA  NA  NA  NA 
D  Tajima’s D  NA  NA  NA  NA  NA 
Da  Net pairwise distance (if two populations)  NA  NA  NA  NA  NA 
Deta  Tajima’s D using eta instead of S  NA  NA  NA  NA  NA 
Dfl  Fu and Li’s D  NA  NA  NA  NA  NA 
Dj  Jost’s D  NA  NA  NA  NA  NA 
Dstar  Fu and Li’s D*  NA  NA  NA  NA  NA 
Dxy  Pairwise distance (if two populations)  NA  NA  NA  NA  NA 
E  Zeng et al.’s E  NA  NA  NA  NA  NA 
F  Fu and Li’s F  NA  NA  NA  NA  NA 
Fis  Inbreeding coefficient  NA  NA  NA  NA  NA 
Fs  Fu’s Fs  NA  NA  NA  NA  NA 
Fst  Hudson’s Fst  NA  NA  NA  NA  NA 
Fstar  Fu and Li’s F*  NA  NA  NA  NA  NA 
Gst  Nei’s Gst  NA  NA  NA  NA  NA 
Gste  Hedrick’s Gst’  NA  NA  NA  NA  NA 
He  Expected heterozygosity  NA  NA  NA  NA  NA 
Hi  Interindividual heterozygosity  NA  NA  NA  NA  NA 
Hns  Fay and Wu’s H (unstandardized)  NA  NA  NA  NA  NA 
Ho  Observed heterozygosity  NA  NA  NA  NA  NA 
Hsd  Fay and Wu’s H (standardized)  NA  NA  NA  NA  NA 
Hst  Hudson’s Hst  NA  NA  NA  NA  NA 
K  Number of haplotypes  NA  NA  NA  NA  NA 
Ke  Number of haplotypes (only ingroup)  NA  NA  NA  NA  NA 
Kst  Hudson’s Kst  NA  NA  NA  NA  NA 
Pi  Nucleotide diversity  NA  NA  NA  NA  NA 
Q  Wall’s Q statistic  NA  NA  NA  NA  NA 
R  Allelic richness  NA  NA  NA  NA  NA 
R2  RamosOnsins and Rozas’s R2 (using singletons)  NA  NA  NA  NA  NA 
R2E  RamosOnsins and Rozas’s R2E (using external singletons)  NA  NA  NA  NA  NA 
R3  RamosOnsins and Rozas’s R3 (using singletons)  NA  NA  NA  NA  NA 
R3E  RamosOnsins and Rozas’s R3E (using external singletons)  NA  NA  NA  NA  NA 
R4  RamosOnsins and Rozas’s R4 (using singletons)  NA  NA  NA  NA  NA 
R4E  RamosOnsins and Rozas’s R4E (using external singletons)  NA  NA  NA  NA  NA 
Rintervals  List of start/end positions of recombination intervals  NA  NA  NA  NA  NA 
Rmin  Minimal number of recombination events  NA  NA  NA  NA  NA 
RminL  Number of sites used to compute Rmin  NA  NA  NA  NA  NA 
S  Number of segregating sites  NA  NA  NA  NA  NA 
Snn  Hudson’s nearest nearest neighbour statistic  NA  NA  NA  NA  NA 
So  Number of segregating orientable sites  NA  NA  NA  NA  NA 
Ss  Number of sites with only one singleton allele  NA  NA  NA  NA  NA 
Sso  Number of orientable sites with only one singleton allele  NA  NA  NA  NA  NA 
V  Allele size variance  NA  NA  NA  NA  NA 
WCisct  Weir and Cockerham for hierarchical structure  NA  NA  NA  NA  NA 
WCist  Weir and Cockerham for diploid data  NA  NA  NA  NA  NA 
WCst  Weir and Cockerham for haploid data  NA  NA  NA  NA  NA 
Z*nS  Kelly et al.’s Z*nS  NA  NA  NA  NA  NA 
Z*nS*  Kelly et al.’s Z*nS*  NA  NA  NA  NA  NA 
ZZ  Rozas et al.’s ZZ  NA  NA  NA  NA  NA 
Za  Rozas et al.’s Za  NA  NA  NA  NA  NA 
ZnS  Kelly et al.’s ZnS  NA  NA  NA  NA  NA 
eta  Minimal number of mutations  NA  NA  NA  NA  NA 
etao  Minimal number of mutations are orientable sites  NA  NA  NA  NA  NA 
lseff  Number of analysed sites  NA  NA  NA  NA  NA 
lseffo  Number of analysed orientable sites  NA  NA  NA  NA  NA 
nM  Number of sites available for MFDM test  NA  NA  NA  NA  NA 
nPairs  Number of allele pairs used for ZnS, Z*nS, and Z*nS*  NA  NA  NA  NA  NA 
nPairsAdj  Allele pairs at adjacent sites (used for ZZ and Za)  NA  NA  NA  NA  NA 
ns_site  Number of analyzed samples per site  NA  NA  NA  NA  NA 
nseff  Average number of exploitable samples  NA  NA  NA  NA  NA 
nseffo  Average number of exploitable samples at orientable sites  NA  NA  NA  NA  NA 
nsingld  Number of derived singletons  NA  NA  NA  NA  NA 
nsmax  Maximal number of available samples per site  NA  NA  NA  NA  NA 
nsmaxo  Maximal number of available samples per orientable site  NA  NA  NA  NA  NA 
numFxA  Number of fixed alleles  NA  NA  NA  NA  NA 
numFxA*  Sites with at least 1 fixed allele  NA  NA  NA  NA  NA 
numFxD  Number of fixed differences  NA  NA  NA  NA  NA 
numFxD*  Sites with at least 1 fixed difference  NA  NA  NA  NA  NA 
numShA  Number of shared alleles  NA  NA  NA  NA  NA 
numShA*  Sites with at least 1 shared allele  NA  NA  NA  NA  NA 
numShP  Number of shared segregating alleles  NA  NA  NA  NA  NA 
numShP*  Sites with at least 1 shared segregating allele  NA  NA  NA  NA  NA 
numSp  Number of populationspecific alleles  NA  NA  NA  NA  NA 
numSp*  Sites with at least 1 popspecific allele  NA  NA  NA  NA  NA 
numSpd  Number of populationspecific derived alleles  NA  NA  NA  NA  NA 
numSpd*  Sites with at least 1 popspecific derived allele  NA  NA  NA  NA  NA 
pM  Pvalue of MDFM test  NA  NA  NA  NA  NA 
rD  R_bar{d} statistic  NA  NA  NA  NA  NA 
singl  Index of sites with at least one singleton allele  NA  NA  NA  NA  NA 
singl_o  Index of sites with at least one singleton allele  NA  NA  NA  NA  NA 
sites  Index of polymorphic sites  NA  NA  NA  NA  NA 
sites_o  Index of orientable polymorphic sites  NA  NA  NA  NA  NA 
thetaH  Fay and Wu’s estimator of theta  NA  NA  NA  NA  NA 
thetaIAM  Theta estimator based on He & IAM model  NA  NA  NA  NA  NA 
thetaL  Zeng et al.’s estimator of theta  NA  NA  NA  NA  NA 
thetaPi  Pi using orientable sites  NA  NA  NA  NA  NA 
thetaSMM  Theta estimator based on He & SMM model  NA  NA  NA  NA  NA 
thetaW  Watterson’s estimator of theta  NA  NA  NA  NA  NA 
Describe the organisation of samples in individuals, populations, and clusters of populations. The structure is necessarily hierarchical and all levels are always defined but it is possible to bypass any level if information for this level is not available or irrelevant. The number of individuals per population can vary, but the number of samples per individuals (that is, the ploidy) must be constant.
New objects are created using the fonctions get_structure() and make_structure(), and old objects can be recycled with the corresponding methods get_structure() and make_structure().
Return a tuple of two dict representing, respectively, the ingroup and outgroup structure.
The ingroup dictionary is a threefold nested dictionary (meaning it is a dictionary of dictionaries of dictionaries) holding lists of sample indexes. The keys are, respectively, cluster, population, and individual labels. Based on how the instance was created, there may be just one item or even none at all in any dictionary. In practice, if d` is the ingroup :class:`dict` and ``clt, pop and idv are, respectively, cluster, population, and individual labels, the expression d[clt][pop][idv] will yield a list of sample indexes.
The outgroup dictionary is a nonnested dictionary with individual labels as keys and lists of sample indexes as values.
Reset the instance as if it was built using get_structure(). The definitions of arguments are identical.
Return a new Structure instance describing the organisation of individuals in clusters and populations (ignoring the intraindividual level) using the rank of individuals as indexes, the individuals being ranked in the order of increasing cluster and population indexes.
Reset the instance as if it was built using make_structure(). The definitions of arguments are identical.
Number of processed outgroup samples.
Number of processed ingroup samples.
Number of clusters.
Total number of ingroup individuals.
Number of outgroup individuals.
Total number of populations.
Ploidy.
Required number of ougroup sample index in objects using this structure (equal to the largest index overall plus one).
Required number of ingroup sample index in objects using this structure (equal to the largest index overall plus one).
Create a new Structure instance based on the group labels of a Align or Container instance.
Parameters: 


Returns:  A new Structure. 
Create a new Structure instance based based on the structure provided as dictionaries. The two arguments must match the format of the return value of the Structure.as_dict() method (see the documentation). Either argument can be replaced by None which is equivalent to an empty dictionary (no samples). Not that all keys must be positive integers.
Parameters: 


Returns:  A new Structure. 
This class allows to generate filters for analysis of allele values:
Holds lists of valid (exploitable and missing) data codes. The constructor allows to immediately instanciate the class with all desired information. It is also possible to call the modifier methods to add information. Note that this module provides a few predefined instances with common settings that the user may use and possibly extend.
Note
If no codes are specified (which is the default behaviour if no arguments are specified), the filter will treat all codes as acceptable.
Parameters: 


Add exploitable values. See the class documentation for more details.
Add missing values. See the class documentation for more details.
Add a range of exploitable data. See the class documentation for more details.
Predefined instances are available. They are all extendable:
All possible allele values are accepted as valid. This filter is casesensitive.
Used for DNA sequences (caseindependent). Exploitable: ACGT. Missing: RYSWKMBDHVN?. Lowercase characters are mapped to the corresponding uppercase character.
Used for RNA sequences (caseindependent). Exploitable: ACGU. Missing: RYSWKMBDHVN?. The IUPAC codes for missing data are identical in meaning to those for DNA sequences. Lowercase characters are mapped to the corresponding uppercase character.
Paranoid filter for DNA sequences. Exploitable: ACGT. Missing: N. Lowercase characters are not allowed.
Used for strictly positive SSR alleles. Exploitable: range 1–999. Missing: 1.
Used for numerical values. Exploitable: range 999–999. Missing: none.
Used for numerical codon representation. Exploitable: range 0–63. Missing: 64.
Used for amino acid sequences. Exploitable: ACDEFGHIKLMNPQRSTVWY. Missing: X. Lowercase characters are mapped to the corresponding uppercase character.
Used for numerical codon representation. Exploitable: range 0–63. Missing: 64.
Here are three variables determining what is done for computing linkage disequilibrium between a pair of sites of which at least one has more than two alleles.
Skip pairs of sites when at least one has more than two alleles.
If more than two alleles, use the most frequent one only.
If more than two alleles, use all possible pairs of alleles.