Here is a complete history of EggLib tracing back to initial pure C++ version.
3.0.0b8 – 2016-07-17
stats.filter_nucl is renamed stats.filter_dna.
Refactoring of the stats module:
- The class SiteFrequency was inherently ambiguous, so it is replaced by Site and Freq which help clarify the design. The stats module provides methods to instanciate both directly from user-provided data, Align, or each other.
- The interface of Structure is modified. The previous design was also exceedingly flexible, thereby confusing. Now Structure is required to have all levels defined (clusters, populations, and individuals) but it is possible to bypass them (place all populations in a single cluster, all individuals in a single population, or, to make haploid data, make individuals with a single item each). To method used to create a Structure are moved to the level of the egglib.stats module (egglib.stats.make_structure() and egglib.stats.get_structure()). The former is equivalent to Structure.from_dict() but you need to specify a single dictionary for all data. It is still tricky to use but helpers methods will be added to hide the complexity.
- ComputeStats is also modified accordingly. The changes should be less significant but they can be still annoying if you have code running. ComputeStats.add_stat() is renamed as ComputeStats.add_stats() (and it allows you to pass several statistics names). The structure and the filter must be passed as argument to ComputeStats.process_align() and not ComputeStats.configure(). This method now always compute average of statistics. To get per-site statistics, you must call ComputeStats.process_site() for all sites. This method ComputeStats.process_site() and ComputeStats.process_freq() can compute statistics from individual sites, and there is also ComputeStats.process_sites() that can process a list of sites. All of those methods take a no_return argument that allows you to process several sites/alignments before computing statistics over all of them.
3.0.0b7 – 2016-05-11
- The method ComputeStats.process_site() was ignoring allele status (number of fixed alleles, etc.) when requested. Thanks to Tatum Mortimer for reporting this bug.
- The “number of fixed differences” statistic was incorrectly named. It actually corresponded to the number of fixed alleles. A fixed difference between a pair of populations is when population 1 is fixed for allele A and population 2 is fixed for allele B, and this accounts for two fixed alleles. Now there are two statistics: numF (number of fixed differences, that is when one allele is fixed in one population and another allele is fixed in the other population), and numFA (number of fixed alleles, which counts all cases when one allele is fixed in a population but absent in the other, regardless of whether the other population is polymorphic).
- The method Simulator.simul() was not actually making a deep copy of the simulated data object, causing an error if the simulator was deleted and the Align deleted (the data could be overwritten), or if new simulations were run. Now a deep copy is made as described in the documentation.
- The create() method of Align and Container did not get outgroup samples.
3.0.0b6 – 2016-05-04
- The bug ComputeStats.process_align() in the previous version is fixed.
- The value of Gst, Gste, and Hst was incorrect. In fact, the correct value could be computed as one minus the reported values for all three statistics in the previous version.
- The PhyML wrapper was not compatible with earlier versions of PhyML (starting from 3.2). The wrapper is now tolerant regarding the .txt extension of output file of the program.
- Installation method for MacOSX is updated. The previous method would overwrite permissions and owner of previously existing directory (which is a problem since the full path of the EggLib module was included in this archive). A, probably, worse problem is that this method made assumptions over the location of the Python installation. The new method is an ad hoc script which manually installs the module in a hopefully appropriate site-package directory. Feedback is welcome.
- Both Align.encode() and Align.rename() (applies also to the equivalent methods of Container) support an argument to include the outgroup samples.
- Align.rename() and Container.rename() return the number of rename operations.
- Added an outgroup option to coalesce.Simulator to automatically move a given population to the outgroup.
- Added stats.ParamList.mk_structure() method.
- Few corrections in the documentation of options to the coalesce module.
- Removed the population-to-individuals flag of Structure (now it is as if it were always True when appropriate).
3.0.0b5 – 2016-04-20
It is now possible to pass SiteArray instances to ComputeStats.process_align(). However this caused a bug that prevents ComputeStats.process_align() to work properly if a Structure is passed. To work around, first call ComputeStats.set_structure() with the Structure object than then ComputeStats.process_align() without the alignment only.
3.0.0b4 – 2016-04-13
3.0.0b3 – 2016-03-22
3.0.0b2 – 2016-03-18
3.0.0b1 – 2016-03-18
The Python module is completed. EggLib 3 is now in beta mode and bugs are being fixed while missing functionalities are being implemented.
3.0.0a – 2014-09-23
Preliminary (alpha, for testing purpose only) release of the version 3. This package contains the C++ new library and a stub Python package providing the updated Align and Container classes and an executable module implementing the coalescence simulator coalesce.
A large number of changes have been introduced when moving from version 2 to version 3. While functionalities have been extended, a lot of changes aim to improve efficiency.
The C++ library has been extensively rewritten, essentially to improve efficency.
- It is not possible to pass a file name to the constructor to initialize the object from a Fasta file. One must now use the function io.from_fasta().
- The interface classes that manage access to data are extended to manage sequences and list of group labels. They are named SampleView, SequenceView, and GroupView.
- Data items are always integers, but input as ASCII strings is allowed, and some methods are designed to export strings.
- There is nore a direct polymorphism() or polymorphismBPP() method. One must use the stats module.
- There is a single to_fasta() method allows to either generate a Fasta-formatted string or write it to a file.
- A bunch of new methods are added.
Diversity statistics are included in a new stats module which is designed to maximize object reuse (therefore improving efficiency). At the moment, a class named ComputeStats manages most statistics.
All wrappers are designed as function (but for the moment, only a few are implemented). The paths are managed by a dedicated class behaving like a dictionnary that supports both runtime and permanent specification of paths to run external paths.
The coalescence simulator is also extensively changed.
- A single class is proposed to manage all parameters and simulations (ComputeStats).
- Replications are now more efficient, especially if the method ComputeStats.iter_simul() is used. It is also possible to compute statistics automatically from simulated datasets and to change parameters between repetitions.
- New features are included (such as delayed samples and change of recombination rate during simulations).
- Some historical events are removed and the number of populations is required to be constant during a simulation (making indexing of populations more logical if events occur), but all models that could be implemented before can still be implemented using given combinations of currently available features.
For the moment, the modules fitmodel and utils are not updated.
Fixed a bug in eggcoal that caused an exception, with error messages stating that EggLib was unable to open (actually, in that case, create) a file.
Ported to Bio++ 2.2.0. The new version is not compatible with previous versions of Bio++: the management of alphabets and genetic codes is modified.
In ParamSet (of the C++ library): the method reset() previously restored objects to 0 population (instead of 1).
Bug fix: the staden() parser (and consequently the staden2fasta command) had an error that shifted sequences that would start after the first sequence finished.
This is bug fix release fixing the following major problem that affected everyone using the summary statistics sets TPS, TPF and TPK (chiefly using abc_sample). The error was that the program used population Pi for the last locus only (ignoring all previous ones). The three summary statistics sets are fixed.
This version fixes the following minor problems:
- eggstats: fixed two missing colons in program output (for Bio++ stats).
- The archive egglib-htmldoc-2.1.6.tar.gz was actually a bzip2 archive.
- egglib-cpp’s configure script has been modified to detect more consistently the GSL library. If you have trouble to get it detected, please contact us. (Thanks to Jérôme Gouzy.)
- The setup.py script takes clags=X and lflags=Y arguments to add X and Y as extra compile and link flags to compilation command lines.
There was a more serious problem in tools and polymorphism analysis: there was a problem with genetic code specification–the code argument was ignored in some cases.
egglib.cpp is modified to support Bio++ version 2.1.0.
This version makes the following minor changes:
- [backalign] tools.backalign() does not crop stop codons out of coding sequences any more.
- [codalign] the codalign command takes a flag to prevent cropping stop codons out of coding sequences.
- [fitmodel] the demographic models all accept a random object in order to control the random number chain (in the generate function)
This version also corrects the following bugs or errors:
- [fitmodel] the documentation of the ABC model SM had incorrect parameter order THETA, DATE, MIGR, [RHO] (correct is THETA, MIGR, DATE, RHO)
- [utils] the seeds argument of ABC simulation commands did not control the random generator objects used by demographic models
This version fixes the following serious bug:
- [diversity] the Fst/Kst/Gst/Hst/Snn statistics might be computed incorrectly if outgroup sequence were not placed at the end of the file (thanks to Emmanuel Reclus).
This version fixes the following minor bugs:
- [Codeml] the wrapper was failing to import site probability for models M1a, M2a, M8a and M8 if the reference was a gap (if the first position reference was a gap, a crash occurred; otherwise, the site probability table was truncated from the first gap position and on) (thanks to Nathalie Chantret).
- [matcher] a ValueError was fixed.
This version makes the following minor changes:
- [Random] the seed1 and seed2 getters become const.
- [Codeml] the wrapper now exports a np key (the number of parameters).
- [fitmodel] a new prior type is added (PriorParser).
This version fixes the following bugs:
- [fitmodel, abc_sample] the statistics set TPF was repaired (it is also modified compared to its previous definition).
- [Align.phylip, wrappers.nj] the phylip converter of Align had a bug and has been repaired and rewritten.
- [tools] a non-ASCII character was accidentally inserted in a comment in tools.py, preventing the package to load on at least some systems.
This version fixes the following bugs:
- [eggstats] the option groups was ignored (the default value was always used).
- [SitePolymorphism, data.Align.polymorphism, eggstats, etc.] non polymorphic sites were not considered as orientable: as a result, the number of orientable sites was always incorrectly reported as <= S.
- [fitmodel, abc_sample] model AM was incorrectly implemented, leading to invalid results.
This version incorporates the following improvements:
- [eggstats] the option outgroup is added, as well as a few statistics.
- [fitmodel, abc_sample] added summary statistics set SDZ
Note on interface changes:
- [eggstats] one additional option.
- [eggstats] if you parse eggstats’s output, beware that statistics have been added, the order is changed and some statistics might be skipped if you set the groups option to no.
This version fixes a single bug: in eggcoal, the default number of threads could be smaller than the number of CPUs under some conditions. The links are updated following the move from the seqlib to egglib sourceforge project.
Version 2.1.0 is a preliminary version of the 2.1 release that will include an additional round of interface-changing changes. The changes listed below are mostly bug-fixes.
- Align and Container method find() now returns None instead of -1 when the specified name is not found.
- There were a few mistakes in the documentation included in the file apps.conf.ini.
- In the documentation of the command ungap, the word “newick” was incorrectly used instead of “fasta” (when specifying the format of the input file).
- Some other minor documentation fixes.
- The documentation of the Align method matrixLD() has been completed.
- The method coalesce now returns ~egglib.SSR instances instead of ~egglib.Align if the number of alleles specified in the mutator if above 4.
- A flag forceSSR is added to the method coalesce.
- All classes of the data module are converted to new-style classes.
- In ~egglib.SSR, when using the load method, population labels were not changed to strings.
- ~egglib.SSR improvements: addition of a str() method and str() support (string formatting), and addition of the indiv2pop mapping data member.
- When egglib.Align.polymorphism() and egglib.Align.polymorphismBPP() are unable to compute a statistics, the corresponding key in the returned dictionary is given a None value (rather than not reporting the statistic at all).
- A check is added in ABC regression method to prevent attempting to fit data files containing model labels.
- Align.remove() in egglib-cpp was returning the length of the alignment instead of the new number of sequences.
- An error lied in the low-level Edge class of the coalescent simulator, potentially generating errors when formatting newick string from ancestral recombination graphs and, potentially, skipping some mutations.
- A tiny change is made to the error message shown by EggInvalidCharacterError.
- In the C++ library, HaplotypeDiversity.haplotypeIndex() nows performs out of bound checking.
- LinkageDisequilibrium.correl() generated invalid results due to a bug.
- tMRCA values obtained by the Ms class of egglib-cpp are changed to double type (previously, they were float, what could cause rounding shifts when accessing them from Python).
- shuffle() had a bug.
- simErrors() is not available for Container instances anymore (for which it was not working).
- The stability of SSR is improved in case of empty data sets and when importing haploid data sets.
- The stability of the parser and extractor of TIGR has been improved.
- The stability of the parser of GenBank was improved.
- The meaning of qualifiers() of egglib.GenBankFeature is changed (the previous version was incorrect).
- rc() of egglib.GenBankFeature.rc doesn’t require an argument anymore.
- Errors corrected in GenBankFeatureLocation methods to add sub-locations.
- Fixed a bug in Tree method to set branch lengths.
- Error fixed in :class:~egglib.Tree.frequency_nodes`.
- BLAST doesn’t accept containers with duplicated names anymore.
- Errors have been fixed in egglib.Tree.get_nodes_re(), egglib.TreeNode.set_branch_from() and egglib.TreeNode.set_branch_to().
- The Clustal alignment format parser in aln2fas() has been fixed and improved.
- The staden() was interpreting the fname as a Staden string. It is now possible to use both mode (read from file or from a string).
- An error was fixed in get_fgenesh().
- In Mase, only ingroup sequences are imported (previously, outgroup sequences were imported at the instance level but not in the internal Align instance. The species name (species attribute) is stripped.
- longest_orf() now takes an option to specifies the minimal length of the returned ORFs. The default value is 1 codon, meaning that single stop codons are no longer returned by default.
- Error management in rc() is slightly modified.
- ungap() now takes an option for ignoring gaps in the outgroup sequence(s).
- Bug fixed in index().
- There was a bug in motifs(): the position of reverse hits was incorrect.
- locate() returns None (instead of -1) for motifs not found.
- exon() of ReadingFrame now returns None if the position is not in an exon.
- Updater now always shows null remaining time when “done” gets larger than “expected”.
- wrap() is slightly improved.
- The ms wrapper support the “prob” line that appears in ms output when both theta and the number of segregating sites have been specified.
- The ms wrapper support the tree line(s) that appear in ms output when it has been requested, and adds a list of :class;`~egglib.Tree` instances to the returned instances under the name trees.
- BLAST wrappers are slightly improved.
- The clustalw wrapper and parser have been improved to support the current version of the program.
- clustal() and muscle() now attempt to preserve group labels and as a result no longer support duplicates in continers. They now take a nogroup flap to disable this feature.
- The following stability issues have been fixed in Codeml: regular expressions sometimes failed to catch some beta parameters; the number of classes of M8a/M8 models was incorrectly reporter as incorect when the number of categories was not default; and, for models A0, A and nW, the class did not checked that the tree has labels beforehand.
- The following stability issues have been fixed in Primer3: “primer not found” messages could occur when lower-case sequences were passed (the comparison are case-dependent - now the sequence is automatically converted to upper case), and when modifying the primer3 parameter relative to the primer first base index (previously, the class did not take this into account when locating the primer).
- The member nMutations was missing from DataMatrix instances returned by coalesce().
- The option randomAncestralState of mutators of the simul module was broken.
- Modification in eggcoal: the program takes a “suffix” option and the “prefix” option can be skipped using a backlash character. The underlying variable _fastaPath becomes _fastaPrefix for clarity.
- eggcoal is also parallelized an accept a max_threads option.
- The command abc_sample now supports parallel computing. See the max_threads option. The step option is removed.
- phyml (both function and utils command) allows to set the starting tree without fixing the topology.
- small bugs fixed in IMn, IMG, IMiG, IMiGn and DOM (with recombination) demographic models.
- The ABC summary statistics stats JFS yielded invalid results.
- The command abc_psimuls now manages simulations without mutations (they previously caused an error). Missing statistics (such as those that are undefined when no polymorphism, or those that are not available) are now replaced by “None”.
- The function execute() of the utils module can be run directly to execute utils commands from python (as normal functions).
- There was a bug in command concatgb‘s default value for option “spacer”.
- Command consensus did not accept separator of length 1 (the separator must be a single character).
- The consensus() method of Align() is made more restrictive: only IUPAC characters are accepted. It returns an alignment gaps only if the gap is fixed (previously it returned a gap when there was at least one gap in the column).
- In extract_clade command, nodes that have a support value equal to the threshold were rejected instead of accepted.
- In extract_clade command, nodes that did not have labels were not supported when the threshold option is used.
- In the family command, BLAST failed when the source sequences were proteins (because the data were cleaned assuming they were nucleotides).
- In the interLD command, the output file had “file 1” twice.
- locate() is changed. Ambiguity characters are now allowed in the target sequence and, importantly, exact matches are found in priority (in order to fasten searches).
- Command staden2fasta had a bug that prevented it from reading any file.
- In the coalescence simulator, if the length of the tree is 0 (no samples), there will be no mutations regardless of the fixed number of mutations (previously, a bug occurred when a fixed number of mutations was requested with no samples).
- A copy constructor is added to Mutator (in egglib-cpp).
- A test subpackage is added to the Python package. It is included in the distributed version although it has not be designed to be routinely used by end-users (it has minimal documentation, a crude reporting system and generates local temporary files in the current directory, so it might deletes user’s files if they happen to have the same name as one of the temporary file names used). This test package helped detect most of the bugs listed above.
This version incorporates a number of minor changes:
- Small changes:
- The utils command phyml accepted an option add_model that was meaningless (and ignored). It is now removed.
- eggstats and the egglib script (or python -m egglib.utils) now reports the version number in the default manual page.
- eggcoal takes a –version or -v option to print out the version number.
- Implementation changes:
- The C++ Fasta parser now provides methods that append sequences to an existing Container.
- Fixed bugs:
- Container could not instanciate from strings.
- The clean command of egglib-py setup.py was broken and caused an error.
- The method Convert.Align() and the program eggcoal, when running with a fixed alignment length and using default mutation positions, failed to sort the mutation positions leading to either incorrect positions (they were clustered to the right-hand end of the alignment) or an error.
The change below fixes an error in the calculation of a statistic:
- Fixed an error in the calculation of triConfigurations (some patterns were counted several times).
- triConfigurations now ignores sites that have 0 sequence in either of the populations.
The changes below are fixes corresponding to crashes or errors:
- Fixed an error that prevented data.Align.polymorphismBPP from running.
- Added an inclusion to the SWIG interface that was necessary for compiling the Python module on a least one system.
- tools.Primer3 (and consequently the utils command sprimers) was broken with recent versions of the program. Now updated to primer3 version 2.2.3.
- Fixed an error that resulted in a crash when displaying help for utils commands (under Windows and source version only).
- The ABC class and the abc_fit commande were unable to compute threshold/perform rejection when at least one statistic was not variable; now they still are unable to do so, but report an informative message error.
- abc_sample (linked to a method of both Prior type) now takes an argument “force_positive” that enforces that drawn parameter values are >=0 (an error is thrown if no positive value is found after a fixed number of tries).
- Documentation of executable commands (python -m egglib.utils concat for example) caused a crash on Windows installations.
- In the coalescent simulator, the case when M=0 preventing simulations to complete was not handled properly (an incorrect error message was issued).
- The stability of wrappers.Primer3.find_primers() was improved (some errors occurred, typically with repetitive sequences where primers could be found at multiple positions in sequences).
The changes below are minor improvements:
- The function for adding models to the ABC analysis is modified. Now the model must be specified as a class with the same name as the module.
The changes below are corrections to the names of statistics reported by polymorphism():
- Polymorphisms is renamed pop_Polymorphisms.
- The following statistics are reported: pair_CommonAlleles, pair_FixedDifferences, pair_SharedAlleles, pop_SpecificAlleles, pop_SpecificDerivedAlleles.
Some statistics are now no longer returned by both polymorphism() and polymorphismBPP() depending on the values of other statistics. For example thetaW and Pi are no longer returned if lseff is 0 and D if S is 0. This is clearly documented in the documentation of both methods.
In addition, several typos were corrected in the documentation.
2.0.1. Windows pre-compiled modules - 11/04/11
- The code from the egglib script is moved to egglib.utils.execute.
- egglib.utils is executable (as an alias for the egglib script).
- egglib.utils.commands is created to hold all executable command classes.
2.0.1 - 26/04/11
New major release. The interface is modified in depth. A few of the many changes are higlighted below:
- The name of the package is changed from SeqLib to EggLib to avoid confusion with other seqlib packages in the same field.
- The C++ library is formally distinct (egglib-cpp).
- Two separate C++ programs (eggstats and eggcoal) are also separated from the rest.
- The remainder is the Python module, egglib-py, whose structure is slightly modified: toolkit becomes tools and utils functions cannot be called anymore from Python code (not easily at least).
- Classes Container, Align, Tree and GenBank are extended and improved (and their names take capitals). In particular, polymorphism analysis is performed though Align methods. They all have more powerful iteration methods. A SSR class is added.
- Additional genetic code are supported for translations.
- Ported to Bio++ version 2.
- The ABC module was rewritten, and made more easy to extend. The regression steps are performed at the C++ level and is more efficient (supports very large data files).
- Interactive commands are standardized under a common interface controlling parameter input and documentation.
- The C++ coalescent simulator is rewritten and now includes recombination, microsatellite and finite site mutation models.
- The Python interface to the C++ coalescent simulator is redesigned to make it more easy to handle.
- The extension module (binding to egglib-cpp) now uses SWIG and doesn’t require any external dynamic library.
- The building process is based on autotools for the C++ packages and on distutils for the Python package.
- Documentation using sphinx.
- Many more changes not documented: please refer to the documentation when migrating from seqlib to EggLib.
1.6 - 02/07/10
This version cumulates several bug fixes and additions. Rule H is modified (single backward compatibility change) and rule I is added. (These rules use the frequency spectrum; type $python -m seqlib.run abc_stats to know more. Note that rule I automatically implies a missing data threshold of 0.70.). Among bug fixes, a problem occurred with haplotype analysis when the outgroup was not at the last position (resulting in memory crashes and possibly in erroneous computation of statistics K, Hd and Fst estimators based on haplotypes).
1.5 - 26/11/09
More minor improvements and bug fixed. The change log is, unfortunately unavailable but notable changes are the addition of stat rule H to the ABC scheme (using the allele frequency spectrum as rejection/regression criteria) and the removal of a bug in the coalescent simulator (that led to the duplication of simulations without polymorphism under a certain combination of options).
1.4 - 24/10/09
Few minor improvements: The command abc_psimuls accepts an option “excludefixed” that allows discarding simulations with S=0 for computing the P-values of D, H and Z statistics. The rule G is changed.
1.3 - 23/10/09
One important bug fix and one addition.
BUG FIX: Migration times were incorrectly drawn in the coalescent simulator. The source code line doing that was accidently deleted!
ADDITION: addition of one set of statistics to the ABC system, allowing to use thetaW, Pi, Snn and their respective coefficient of variation in order to fit structure population models.
1.2 - 06/10/09
With respect to version 1.0, this version fixes bugs and introduces candidate features. The first bug listed led seqlib to output incorrect results. Thanks to Sonja Kujala and Thomas Källman for helping solving these problems.
- The statistics H, thetaH and Z (Fay and Wu’s test) were incorrect. H was incorrect since version 1.0 and Z was incorrect since the beginning. The error was causing a deviation or an order of ~0.1 of statistics H and Z that was consistent between simulations and computations from real data.
- The method rempos (of Align and align) did not terminate correctly sequence strings.
- The coalescent simulator used population indices starting at 0 when S was 0 and from 1 otherwise. Now indices always start at 0.
- abc_stats didn’t support fixed parameters (when min=max).
- a ‘collinear matrix’ error message was returned by abc_fit when one (or more) of the statistics where not variable within the local region. Now, abc_fit takes an argument force that forces it to proceeds to the analysis in such case (as long as at least one statistic is variable), although it is always preferable that at least as many independent statistics as the number of parameters to estimate are available.
- the pyinter class container had a method column() whose use led to a bug.
- class tree (of toolkit) enhanced with new methods, including midroot() that performs automatic rooting using the midpoint method.
- creation of class codeml.
- creation of function phyml3 (planned to replace the class phyml and using PHYML v. 3).
- creation of command picker to replace family (it is strongly advised to keep using family).
- new statistics in Polymorphism and polymorphism(), including singletons.
- member shuffle() in class container.
- argument “strict” of ``container`` classes’ method find().
- clustal() uses temporary files, allowing its use in several parallel instances of Python.
- creation of the command interLD, allowing computing linkage disequilibrium between two loci (based on haplotypes, considering all alleles), and test it by random permutations.
No information available.
1.0 - 07/06/09
The changes from version 0.8 are listed below. The list is unfortunately non-exhaustive. In particular, many small interface changes and bug fixes are not listed. The changes are grouped by subpackage:
- seqlib (top-level)
- A user manual is now included.
- The utils commands must be launched through the had-oc module seqlib.run.
- The presence of external applications is monitored by the file config.py created by setup.py at installation.
- Ported to Python 2.6 (this is now the primary target).
- The structure is changed: the library is split into core, pyinter, toolkit, and utils.
- The contents of pyinter and toolkit are both loaded both in the top seqlib namespace.
- The doxygen documentation is fixed (but some formatting troubles remain).
- The package is reorganized to fit to a correct Python module.
Errors generated in seqlib.core’s code systematically raise SeqlibException.
The previous error() flag system is removed.
- All sequences have an integer label (supposed to indicate population membership). This modification is supported by IO, Polymorphism and Coalesce.
- The internals of both classes are reimplemented, allowing better performance for data access.
- vslice(a,b) supports b>a (returns an empty alignment) & fixed bug : the groups were dismissed in all slices.
- The underlying class Sequence is removed.
- Accessors set() and get() for nucleotides.
- An undue error was raised when the last sequence was removed.
- Align::Align(unsigned int, unsigned int, char**): this function was not implemented
- fget() replaces get().
- hlice(): the interface is changed to fix the one vslice().
Added reading modes “e” and “a”.
- is completely rewritten, with minor interface changes.
- The class reads the group information from the Align objects (passed by address).
- The header is now in Polymorphism.h.
- Did not compute pread() correctly.
- pairwise() is removed; one now needs to use analyze() with group labels. a bunch of group label stats (Fst, Kst, Hst, Gst, Snn and site pattern counters) are added.
- analyze’s option outgroup removed; one needs to specify an outgrup sequence using group label 999.
- Si is removed.
- as a general rule, stats that cannot be computed and stats are set to default values (0). That concerns per-site statistics (when no analyzable sites are available), stats that require an outgroup.
- Added haplotype(), LD().
VAlign: clear() function added to VAlign.
- Options skipStatistics and saveAlignments. Storage of Align objects.
- Support for null mutation rate or FSS.
- Supports simulations with only 1 sample.
- Intercept null migraton rates as an error.
- By default, K is 1.
- Using “fusion” generated a bug.
- The generator of newick trees was unstable.
- Supports empty fasta files.
- toPhyml(): the names are limited to 30 characters.
- Parser supports and ignores \r characters (in both sequences and names).
- Added flag delete_consensus.
- Possible to import termination (*) for proteins.
Container/Align: ns() is reimplemented (using a class member) to speed up repetitive calls.
in polymorphism analysis, a conceptual error led to inappropriate results of He when an outgroup or missing data were present.
A couple of compilation errors are fixed (use of _N and _S symbols).
BppWrapper: Ts/Tv is arbitrarily set to 0. if Tv=0.
Added class LDContainer.
Staden: supports for \r characters.
- All sequences have an integer label (supposed to indicate population membership).
- The sequence readers, writers, simulators and analyzers are modified accordingly.
- Added methods str(), missing().
- added filter() method to align.
- An undue error was raised when the last sequence was removed.
- Long integers are supported for group labels.
- polymorphism(): interface change:
- no outgroup option anymore (the outrgroup should be one of the sequences of the align object, with group label 999).
- interpop stats are automatically computed when several pops are defined in the object.
- added “haplotypes” key.
- (BPP) Ts/Tv is arbitrarily set to 0. if Tv=0.
pairwise() is removed.
consensus() is moved to utils.
in polymorphism analysis, a conceptual error led to inappropriate results of He when an outgroup or missing data were present.
dist() is removed.
interface() is removed.
- added argument simErrors.
- fasdir can be None/False.
- returns a list.
xml: raises exceptions in case of error.
xml ignore \r characters.
Simulators had a conflict with the name He (used for both Hd and He).
CoalesceSimulator renamed coalesceSimulator.
msSimulator: can compute orientation-based statistics.
Added SkipStats to simulators.
rlen() moved to pyinter.
Additions: nj(), staden_consensus(), muscle().
newick(): supports \r.
longest_orf() has been reimplemented - the external application getorf is no longer required. Faster.
The function rlen() is moved from the module seqtools.py to tools.py.
tree: bug fixed in frequency_nodes().
- was sometimes unable to import TITLE.
- supports any carriage return.
Added functions stats() and correl(), and classes paml, updater and timer.
distribution.py is deleted.
cprimers(), sprimers(): bug fixes and minor improvement of usability.
rc(): faster implementation.
backalign(): added option name_table.
flocate() replaces locate(). Use locate() for the fast (and only available) implementation.
ranges(): supports unsorted data.
primer3: the fixed parameters are put into string_init and string is reinitialized at each call to find().
isstream: broken method read().
chisquare(): the function was broken, and returns the critical value for (n+1) ddl instead of n.
The module tools is removed. The classes implementing abc commands are now directly in the seqlib.utils namespace.
rs (and other rs* commands) are removed and replaced by abc_* commands and a set of classes. Note that the behaviour of rs can be reproduced by abc_sample and abc_fit (with regress=False).
Approximate Bayesian Computation: The commands abc_sample, abc_fit, abc_stats and abc_psimuls are introduced. rs and associated commands (rsplot, etc.) are removed and replaced by commands names abc_sample, abc_fit, etc. the abc family of commands extends the features previously incorporated in rs, but also incorporates a number of modifications from version 0.8.
Faster implementation of the ABC discretization method.
Added commands: fasta2phyml(), winphyml(), translate(), instruct(), extract_clade(), extract_nclade(), infos().
sprimers: significantly improved, with option additions and behavior change. In particular the blast check step was refined (with significantly improved stringency). The position score (3’ preference) was wrong (reverted because of BLAST). Bug fixed (gaps were allowed in blast searches).
analyser() and stats() outputs Gst (and so on) - stats() supports group labels in input fasta file.
codalign(): changed to support longer file names, and doesn’t alter names anymore (spaces replaced by underscores). Added option “software” (can use muscle rather than clustalw).
fasta2nexus(): generates valid protein nexus files.
analyzer() becomes analyser().
input/output arguments syntax extended or modified for: clean_seq(), clean_tree(), codalign(), concat(), concatgb(), extract(), extract_clade(), fasta2nexus(), fasta2phyml(), fg2gb(), matcher(), rename(), select() (and others).
- removes the “*” wild-card.
- the list file must use newlines as item separators.
core now compiles successfully with GCC 4
- fixed: when several trees where imported, they were all accidentally merged (problem with superficial copy).
- added: rename_leaves, clades, frequency_nodes methods.
Polymorphism and polymorphism provide the list of polymorphic sites
discret becomes rs_analyse and now produces an output with stats.
stats function added to utils.
coalesce output was crappy (ie not supported by function ms) for simulations without polymorphic sites.
(4.)0.7.2 - 16.10.08
A few improvments and bug fixes.
(4.)0.7.1 - 16.09.08
pylab import generated crash when matplotlib was absent (fixed: the presence of matplotlib is no longer enforced)
useless params output by sprimers was fixed
Hnew of polymorphism renamed to Z
default values of simulators changed
added a trim option to discret
- sprimers has been improved:
- filter replaced by filter1 and filter2 (filter1 occurring before the blast step)
- both sorting steps (before and after the blast step) were wrong
- ranges, ungap, names and rename as utils commands
- names, duplicates, contains_duplicates and no_duplicates as fasta methods
- translation in toolkit
- nexus method in fasta.align and fasta2nexus command
(4.)0.7.0 - 12.09.08
fasta string import extended to containers.
- plot is depreciated replaced by
- discret (doesn’t clean up empty classes any more)
align is fixed to support alignments with length = 0
Random seeds are now static: that means that seeds are set by the complete program. Previously (since 4.0.4), different objets created with less than 1 second of delay had the same seeds. As a result, rs simulated identical loci, resulting in increased variance of statistics and a very poor estimation.
- error in time formatting after more than one day (fixed).
- incremental counting of time (a priori, transparent change)
- trims 0-frquency classes out of prior
- fixed bug cause by Random error (above)
- fixed error in SPM (M was ignored and errorly fixed at simul’s default value!)
- uses a harcoded (not in a separated file) very large prior distribution.
- the setup.py script is radically modified:
clean: removes object files and cleans sip configure: only creates a Makefile sip: compiles sip install: same as before The installation process should go:> python setup.py sip > python setup.py configure > make > python setup.py install
setup also accepts some arguments to modify a few system options
sprimers check was so stringent that the step was completely removed
gb: added method rc (reverse-complement)
utils: added commands extractgb and gb2fas (no doc written yet)
(4.)0.6 - 27.08.08
added composition() method to fasta base class.
- additions to Toolkit:
- this function is directly imported from a script “Genalys2Fasta” (version 05/07/06).
- the function has not been tested at all (more than the previous script). there may be a problem if initial files were not named .ab1.
blast hits are sorted according to e-values.
codalign(): cds argument may be a container instance.
primer3: check() is made a different function from pair() and find_and_pair() (both lose the argument check)
created a function flocate() in Toolkit (faster implementation on the basis of a regular expression search).
blast: inclusion of query-from, query-to and midline in hits entries.
added fasta string import to IO (core) and to align (pyinter) constructor.
ms parser draws nucleotides randomly.
(4.)0.5 - 19.08.08
- additions to Utils:
- staden2fasta This function re-implements part of the program tofasta. As of version 2.5 tofasta is now deprecated. Changes: (1) the interface changes, (2) CONSENSUS is always deleted, (3) dot (‘.’) characters are supported and resolved using CONSENSUS (before deletion), (4) no generation of consensus sequences.
bug fixed in mase parser.
mase extended: copy from align instances, and writer function.
(4.)0.4 - 18.08.08
- created help page for utils direct calls.
- io.ms() IO.ms() both use (by default) standard input.
- Align and Container had a problem in copy constructors: an empty sequence (instead of no sequences at all) was added when copying from an empty object.
- Ms (and therefore IO.ms() and io.ms()) did not support an trailing empty null simulation.
- dist() function (in pyinter, manips) was fixed and the order of parameters in the output tuple was changed (to be compatible with polymorphism::pairwise())
- dist(): argument type added.
- slider() added to toolkit.
- introduced mode debug for running utils function through seqlib (shows full error message).
- extensions of rs: introduction of option rule and addition of model 6 (using ms).
- ms incorporated in the package.
- Random used to take its address on memory as second seed. This seemed to cause problems depending on the system and was changed to a constant second seed (0.). The first seed is still the system time, and it’s still possible to set arbitrary seeds.
- added import_posterior, clean_tree, clean_seq concatgb and concat functions to Utils.
- non-keyword arguments are passed to Utils functions (they may be ignored, as well as unknown keywords.
- primer3 default Tm range was much narrower than claimed (61-65 instead of 55-65).
- a problem with the function ranges of prior was fixed (appeared when using priors with more than 1 class).
- rs accepts a maxsim argument to stop simulations after a givennumber of simulations (by default, 1000000000).
(4.)0.3 - 07.08.08
SIP is now included in the distribution.
- setup.py changes:
- options removed: pyinc, pylib, cpath and compiler
- compiles SIP
- enforces the use of g++
- Toolkit/blast: each hit entry contains:
- ‘pos’, the positions of the first Hsp (individual hit fragment),
- the e-value (‘e’),
- ‘identity’, the identity rate
(4.)0.2 - 05.08.08
- Polymorphism: Possible bug: count of segregating sites when MULTIPLE is true (sites may be missed).
- the names of some private members (such as _A) in Changes, Coalesce and Polymorphism have been changed to make Xcode compiler happy.
- two memory leaks have been fixed in Sequence and one in Site (causing problems to Polymorphism and Coalesce).
- change in setup.py: now uses the sipconfig module to finds Python installation paths
(4.)0 - 28.July.08 (alpha4)
- utils::rs::rs finished (not tested)
(4.)0 - 24.July.2008 (alpha3)
SeqLib is released publicly and numbering is reset to 0.
- bugs fixed in setup.py:
- option BPP not processed correctly.
- inclusion not system independent.
- flush output during compilation (not a bug).
- determines itself python installation details.
- incorportation of utils (preliminary)
- rs (on-going)
- gb parser temporarilly failed if >1 ‘=’ sign in feature (bug fixed)
- in seqtools, locate() used amb_compare instead of compare (bug fixed)
- addition of lfimport function in fasta
- compilation in optimization mode 3 (hopefully faster)
- missing imports in dataset and tools
- dataset’s select method extended and modified
(4.)0 - 08.July.208 (alpha2)
- formatting the release (license, readme, setup script).
- Bio++ is made optional
- toolkit is completely incorporated
- doxygen documentation
(4.)0 - 23.May.2008 (alpha1)
- KNOWN ISSUES
- mingw support is removed (has to be added in skip_line and next_line functions!)
- noted a possible problem(in consensus generation): example A+T+A (rigorous) ->W+A -> A ( = problem)
newick is not stable, apparently (TODO: use standard libraries for XML and tree)
reprogram XML using default python modules
reprogram tree and newick
memory leak in rs
Lots of changes in the interface and the implementation.
Not all changes are listed below.
creation of the seqlib namespace
- added a simplified wrapper of vector for Align (VAlign) and unsigned int (Vuint) with no checking
these classes provide a SIP interface and are designed for being used by a Python wrapper (never directly)
- incorporation of the module coalesce
deletion of BaseCoalesce (classes are integrated in the Seqlib hierarchy)
other classes are just ported with minor compatibility changes
- pi attribute of Coalesce changed to Pi
- uses new version of Polymorphism
- removed clear_error
- statistics of irrelavant data type are initialyzed
- in case of error: sets everything to 0/default
- apparently its impossible to set alpha<0. the blocking is maintained.
- blank line added after header in data file, plus between simulations for microsats
- added tMRCA statistic
other former classes of the BaseCoalesce hierarchy are in a “coalesce” namespace
- creation of BppWrapper:
- available only with mode dna at the moment (translated as DNA for bpp)
Pairwise: deleted and transfered to Polymorphism
- compatibility changes
- the constructor closes the input file after use
- return Vuint objects
- Consensus (incorporated in Polymorphism):
- doesn’t write anything anywhere, except a report in an internal string
- note: some use of vector (check whether any other container may be better)
- missing: missing code in input (?)
- disagrement: code for disagreemnt in output (non rigorous mode) (Z)
- constructor calling directly analyze
- both take more arguments
- the same object can be used several times
- analyze returns the number of polymorphic sites or -1 in case of error
- site accessors are deleted (sites are not stored any more)
- sites with more than 2 alleles are accepted: always: eta
- consensus() function
- pairwise() function collecting Pairwise functionalities
- wrong data type leads to 0 polymorphism, not error (false characters are taken as missing)
- don’t store actual data anymore (no more get() accesser)
- carriers reimplemented as a pointer, and initialized at construction
- minor change in interface
- no destruction of the data pointer
- automatic conversion to upper case
- possible to set an outgroup with mode b - otherwise, 0 are taken to be ancestral
- the linked list feature is DELETED
- observations (these are no change):
- the usage of newlines for separating exons is enforced in constructor but no in method import()
- the format is very sensitive to spaces, don’t add any other positions than specified
- the numbering of the input is not converted
- renamed to Ms and linked to from IO
- copy is implicitely allowed
- the class manages a pointer to the stream
- size limits are removed
- renamed as Staden
- simplified interface: only import which returns an Align
- import uses CONSENSUS to resolve . characters
- import deletes CONSENSUS
- renamed as IO
- significant changes of the interface: reading functions return an object and writing functions take an object as argument
- no longer length limit (use of queues)
- incorporates a call to Staden::convert (less efficient because of an additional object copy)
- incorporates Ms call
- removed DATA_TYPE, MINIMUM_READ, SKIP_RM, SMALL_DIFF and MULTIPLE_HITS_ACCEPTED
- change interface of isValid() to accept type character
- isValid() is made case-insensitive
- add constructor Sequence(number, char) to initialize an empty sequence
- concatenating sequences with different names is no longer fatal
- oor errors for get(), set(), rem()
- suppress build_helper() helper function and lname, lseq members
- pname(), psequence() become name() and sequence()
- copy constructor supports overwriting
- remSeq() now checks
- equalize() takes an optional padding character as argument
- pname, psequence, psequence2 renamed to name, sequence and getSequence (respectively)
- slice() becomes hslice()
- still doesn’t perform any test
- get() checks
- binSwitch() checks p and binary data
- subset() becomes vslice() (with an overloaded function vslice(a,b)
- vslice(vector<>) re-implemented (a bit) more efficiently, but now the order in the vector is strictly followed
3.2.8 - 28.04.08
- 28/04/08: SequenceAlignment::getColumn returns NULL in case of invalid index (and error statements)
- 13/03/08: slice now accepts a=b arguments
3.2.7 - 12/03/08
- Pairwise: dist() was wrongly divided by the number of (overall) polymorphic sites
3.2.6 - 04/03/08
- GetMs: reading buffer increase to 500000 (instead of 50000): support larger lines (ie simulations with many more sites)
- ReadingFrame: added function last()
- Polymorphism: change in D(): in case the variance is close to zero (compared to SMALL_DIFF) is catched and its set to zero this avoids taking the square root of a (slightly) negative number and having an indefinite #IND D (although it will stay infinite #INF)
- Added field SMALL_DIFF in Seqlib (used by Polymorphism:D() as stated above)
3.2.5 - 28/02/08
- Changes in SequenceContainer::slice()
both arguments are made int, no default value checks are now performed and an error is set in case of any problem with indices upon such case, an empty container is return
Bug in SequenceContainer - SequenceAlignment: error generated when the last sequence was removed in SequenceAlignment, lseq was not set to 0 because of missing virtual linking
3.2.4 - 25/02/08
- Bug fixed in GetStadenAlign: in getshift(), the rewind loop did not seem to work properly it has been replaced by a simple close+open operation required storage of the file name
3.2.3 - 23/02/08
- Bug fixed in SequenceContainer::remseq(): the loop for renumbering did not consider the last step
- Iterators of SequenceAlignment are converted in SequenceAlignment*
- SequenceContainer::build_helper() is deleted and replaced by its actual loop in SequenceContainer and descendants
3.2.2 - 14/02/08
- GetStadenAlign: bug fixed, a bug was generated by constructor GetStadenAlign(const char*)
3.2.1 - 11/01/08
- The SeqlibException’s have been abandonned for the moment.
Check ::error() instead (should be an empty string)
Changes in GetMS() (public functions added)
- void close():
- destroy the input stream
- good() will return false
- calls to import(bool) will generate errors
- SequenceAlignment simul(bool binary = false):
- wraps import(bool) (useful for Python where import is reserved)
- its adviced to use import(bool) in C++
- Each class has its own header file
- The library is compiled as a static archive
- All output goes through Seqlib::error( ) and generates a SeqlibException
- typedef uint removed
- Several bug fixes and changes (including in the interface)
3.1.1 - 18/08/07
- Frame.h added with ReadingFrame and CodingSite (they are not incorporated in the Seqlib hierarchy)
3.1.0 - 02/08/07
- GetStadenAlign.h becomes Import.h
- creation of GetMS added to Import.h
Unnumbered - 01/AUG/2007
Polymorphism: - added access method site(int) - bug fixed in Site (see documentation of Site) - outgroup value checked
3.0 - 31/07/07
SequenceAlignment splitted into SequenceContainer (just a list of sequences) and SequenceAlignment (forced to be equalized)
SequenceContainerIO replaces (with no notable changes) SequenceAlignmentI and O (note that it is a SequenceContainer)
Creation of Pairwise comparing to SequenceAlignment (divergence-like class)
GetStadenAlign is updated (more changes in header files)
- Classes are grouped following kinda logic
- Seqlib.h: Seqlib, Sequence, SequenceContainer, SequenceAlignment, SequenceContainerIO
- Polymorphism.h: Site, Polymorphism, Pairwise
- GetStadenAlign.h: GetStadenAlign
Bug fixed in SequenceAlignment::build_helper(): initialization of rank
- Class hierarchy
- SequenceContainer (has Sequence)
Polymorphism (has Site, SequenceAlignment)
GetStadenAlign (has Site, SequenceAlignment)
2.2 - 25/MAY/07
ReadingFrame: constructor accepts the index of an outgroup that will not be included
2.1 - 23/FEB/2007
Polymorphism: - Create from a combination of code from previous classes Analyser and SequencePolymorphism (from Seqlib 1).
2 - 23/02/07
- The library is written on a c-like fashion, data storage is malloc (for sequences) and linked list (new) for sequence alignments
- Input and output are interfaced by two classes, SequenceAlignmentI and SequenceAlignmentO
- Seqlib is introduced as a general base class containing DATA_TYPE, MINIMUM_READ, SKIP_RM and FORCE_ALIGNMENT
1.2 - 10/JUN/2006
Changes in ReadingFrame: - allowing different codon start - good( ) function removed - reads into an open stream - frameQ created
1.1 - 16/MAY/2006
ReadingFrame: corrected error in NS/S sites per codon: mutations to stops were not excluded, now they are
- SequenceContainer class hierarchy, data storage as vectors
- no information