Programs designed and implemented by M. Göker for sequence clustering and related tasks. If you use any of the programs in a publication, please cite this web page. Each of the zip files with software contains three executables (for Linux, MacOS10, and Windows).
The program OPTSIL is a tool for the optimization of threshold-based linkage clustering runs. It was developed for molecular taxonomy, even though other applications are possible. The typical usage is to conduct clusterings for a series of distance thresholds, and to calculate the agreement between each resulting partition and one or several reference partitions. The optimal threshold values are those for which the agreement is highest. The whole range of linkage clustering approaches from single linkage to complete linkage can be optimized.
The zip archive provided for download also contains readme and example files.
Version 1.5 This version fixes a bug in optsil 1.2 on Windows and Mac that caused the last reference partition to be ignored.
Version 1.2 Contains a minor bug, see version 1.5. A workaround for version 1.2 is to add an dummy reference partition as last column of the reference-partition file. Note that the partition similarity values for the other reference paritions, or any other numeric results, are not affected by the bug.
Details of clustering optimization are described in Göker et al. 2009; please cite this paper if you apply the program. Using optsil, we were able to determine the most suitable species concept for Peronospora and to improve the annotation of the ITS sequences deposited in Genbank for this genus of downy mildews (self-cleaning of Genbank data).
OPTSIL also calculates barcoding gaps. Examples are provided in Stielow et al. 2010, "The neglected hypogeous fungus Hydnotrya bailii Soehner (1959) is a widespread sister taxon of Hydnotrya tulasnei (Berk.) Berk. & Broome (1846)" and Guevara-Guerrero et al. 2011, "Genea mexicana, sp. nov., and Geopora tolucana, sp. nov., new hypogeous Pyronemataceae from Mexico, and the taxonomy of Geopora reevaluated".
Optimizing clustering parameters for one dataset and applying it to another allows one to objectively compare biodiversity (given the limits of the respective molecular marker used). See Schlee et al. 2010, "Relicts Within the Genus Complex Astragalus/Oxytropis (Fabaceae), and the Comparison of Diversity by Objective Means"
Clustering optimization can also be used to identify the best distance function for molecular taxonomy. An example is given in Göker et al. 2010, "A clustering optimization strategy for molecular taxonomy applied to planktonic foraminifera SSU rDNA".
We also applied clustering optimization to the taxonomically very difficult fungal genus Hymenogaster. The results indicate (1) which species concept from literature is optimal; (2) how the remaining discrepancies between molecular data and classification have to be resolved by further revising the classification; (3) how Genbank sequences have to be (re-)named. See Stielow et al. 2011, "Species delimitation in taxonomically difficult fungi: the case of Hymenogaster".
Clustering optimization is also useful in comparing distinct genes and their implications regarding biodiversity estimates. See Setaro et al. 2011, "A clustering optimization strategy to estimate species richness of Sebacinales in the tropical Andes based on molecular sequences from distinct DNA regions".
Göker, M., "Clustering Optimization For Molecular Taxonomy". BioSystematics Berlin 2011 (7th International Congress of Systematic and Evolutionary Biology = 12th Annual Meeting of the Society of Biological Systematics = 20th International Symposium "Biodiversity and Evolutionary Biology" of the German Botanical Society). Berlin/Germany 2011. [Abstract]
Stielow, B., Bratek, Z., Orczán, K.A., Hensel, G., Hoffmann, P., Klenk, H.P., Göker, M., "Species delimitation in taxonomically difficult fungi: the case of Hymenogaster". International congress of the German Mycological Society (DGFM), Hamburg/Germany 2010. [Abstract]
Stielow, B., Bratek, Z., Orczán, K.A., Hensel, G., Hoffmann, P., Göker, M., "Species delimitation in taxonomically difficult fungi: the case of Hymenogaster". The 9th International Mycological Congress Programme Book, Lecture No. U7.O3, Edinburgh/Great Britain 2010. [Abstract]
Göker, M., "Defining biologically meaningful molecular operational taxonomic units". Joint Meeting of the Association for Tropical Biology and Conservation and the Society for Tropical Ecology: "Impacts of Global Change on Tropical Ecosystems - cross-cutting the Abiotic, Biotic and Human Spheres", Marburg/Germany 2009. [Abstract]
Göker, M., "Clustering optimisation techniques to define biologically meaningful molecular operational taxonomic units (MOTUs)". Workshop "Mycorrhizas in Tropical Forests", Loja/Ecuador 2008. [Abstract]
gbk2fas: Like the old version, but many more options. 27 distinct feature entries can be extracted from Genbank flatfiles and placed in FASTA headers or CSV files. Lists with product names can be created, as well as m4 mapping files for later replacement.
gbk2fas, old version: use this program to convert sequence data in Genbank flatfiles to FASTA format. This may seem to be a simple task, but gbk2fas enables one to (1) adapt the FASTA headers to each user's needs; (2) write out only the accessions in the FASTA headers and create a file that can be used to replace the headers in alignments, tree files, etc. later on by full names; (3) place the information from Genbank files in CSV files (as a 1st step in data mining). For instance, reference partitions for clustering optimization with optsil (see above) or host-associate relations can be created. If you apply gbk2fas, please cite Göker et al. 2010.
rbc: rbc.tcl is a script to conduct blastclust clustering with a user-defined range of similarity thresholds. The output of a series of rbc optimization runs can be read by optsil. One needs to install tcl to run the script.
treeinsert: simple tcl script to append cluster numbers to labels in Newick trees. Needs the csv output of optsil clustering with a fixed threshold.
In case you have any questions regarding these programs, send an e-mail to support [at] goeker [dot] org. Please include the name of the program somewhere in the subject.