optsil is a program for the optimization of threshold-based linkage clustering runs. The typical usage is to conduct clusterings for a series of distance thresholds, and to calculate the agreement between each resulting clustering partition and one or several reference partitions. The optimal threshold values are those for which the agreement is highest.
Currently optsil does not calculate the distances. This needs to be done by an external program. In our view, this greatly adds to the flexibility of the process: Many different distance algorithms can be tested. An example for the use with PAUP* is given below.
The program has been designed and implemented by Markus Goeker in the Ada programming language. For questions, suggestions, bug reports etc., send an e-mail to < support AT goeker DOT org >. If you use the program, please cite Goeker et al. 2009.
optsil is a command-line program that runs in a UNIX terminal or in a DOS window. It has no graphical user interface.
optsil_{linux,mac,windows}: Executables for the three operating systems
a.tab: Example distance file in squared tab-separated format
a.dst: Example distance file in extended PHYLIP format
a.ref: Example reference clustering. 1st column: names of objects (species); 2nd column: their genera; 3rd column: their families.
readme.txt, readme.html: This file in plain text or html format. The html file has been generated using markdown.
For operating systems other than Linux, replace "linux" in the following examples by "mac" or "windows".
List of command-line arguments
The complete list of command-line arguments is obtained by entering the program name without arguments:
optsil_linux
On Linux, this requires the executable to be located in one of the folders contained in the $PATH environment variable. Otherwise you have to enter the full path to the executable. If it is within your working directory, enter
./optsil_linux
Optimization
Test thresholds from 0 upto 0.2 with a step width of 0.01 against the reference partitions in a.ref:
optsil_linux -t -v -i 0.0 -s 0.01 -a 0.2 -o a.ref a.tab > a.opt.verbose.out
The contents of a.opt.verbose.out should be like this:
Threshold No._Clusters Shared_I._0 Modif._RI_0 Rand_Ind._0 Shared_I._1 Modif._RI_1 Rand_Ind._1 Shared_I._2 Modif._RI_2 Rand_Ind._2
0.00000000 10.00000000 0.81938200 0.00000000 0.93333333 0.26529500 0.00000000 0.46666667 0.54233850 0.00000000 0.70000000
0.01000000 10.00000000 0.81938200 0.00000000 0.93333333 0.26529500 0.00000000 0.46666667 0.54233850 0.00000000 0.70000000
0.02000000 10.00000000 0.81938200 0.00000000 0.93333333 0.26529500 0.00000000 0.46666667 0.54233850 0.00000000 0.70000000
0.03000000 10.00000000 0.81938200 0.00000000 0.93333333 0.26529500 0.00000000 0.46666667 0.54233850 0.00000000 0.70000000
0.04000000 10.00000000 0.81938200 0.00000000 0.93333333 0.26529500 0.00000000 0.46666667 0.54233850 0.00000000 0.70000000
0.05000000 9.00000000 0.87187405 0.48275862 0.95555556 0.28229058 0.03899721 0.48888889 0.57708232 0.26087792 0.72222222
0.06000000 8.00000000 0.93155205 0.78873239 0.97777778 0.30161279 0.07821229 0.51111111 0.61658242 0.43347234 0.74444444
0.07000000 7.00000000 0.86310409 0.64285714 0.95555556 0.32377450 0.11764706 0.53333333 0.59343929 0.38025210 0.74444444
0.08000000 6.00000000 0.76882089 0.45454545 0.91111111 0.36023411 0.19718310 0.57777778 0.56452750 0.32586428 0.74444444
0.09000000 6.00000000 0.76882089 0.45454545 0.91111111 0.36023411 0.19718310 0.57777778 0.56452750 0.32586428 0.74444444
0.10000000 6.00000000 0.76882089 0.45454545 0.91111111 0.36023411 0.19718310 0.57777778 0.56452750 0.32586428 0.74444444
0.11000000 6.00000000 0.76882089 0.45454545 0.91111111 0.36023411 0.19718310 0.57777778 0.56452750 0.32586428 0.74444444
0.12000000 5.00000000 0.82531179 0.63414634 0.93333333 0.39230567 0.23728814 0.60000000 0.60880873 0.43571724 0.76666667
0.13000000 4.00000000 0.72410090 0.49664430 0.88888889 0.44714003 0.31818182 0.64444444 0.58562046 0.40741306 0.76666667
0.14000000 4.00000000 0.72410090 0.49664430 0.88888889 0.44714003 0.31818182 0.64444444 0.58562046 0.40741306 0.76666667
0.15000000 3.00000000 0.57714625 0.32835821 0.80000000 0.56099212 0.48275862 0.73333333 0.56906918 0.40555841 0.76666667
0.16000000 1.00000000 0.00000000 -0.00000000 0.06666667 0.00000000 -0.00000000 0.53333333 0.00000000 -0.00000000 0.30000000
0.17000000 1.00000000 0.00000000 -0.00000000 0.06666667 0.00000000 -0.00000000 0.53333333 0.00000000 -0.00000000 0.30000000
0.18000000 1.00000000 0.00000000 -0.00000000 0.06666667 0.00000000 -0.00000000 0.53333333 0.00000000 -0.00000000 0.30000000
0.19000000 1.00000000 0.00000000 -0.00000000 0.06666667 0.00000000 -0.00000000 0.53333333 0.00000000 -0.00000000 0.30000000
0.20000000 1.00000000 0.00000000 -0.00000000 0.06666667 0.00000000 -0.00000000 0.53333333 0.00000000 -0.00000000 0.30000000
The interpretation is as follows. Regarding the agreement of the clustering partition with the affiliation to genera (columns 3-5), the Shared Information (column 3), the Modified Rand Index (column 4) and the original Rand Index (column 5) criterion indicate that a threshold of 0.06 is optimal. Regarding families (columns 6-8), a threshold of 0.15 is recommended by all criteria. Regarding the average agreement with genera and families (columns 9-11), either 0.06, 0.12 or 0.12-0.15 is optimal, dependent on the criterion applied.
Less verbose output is produced as follows:
optsil_linux -t -i 0.0 -s 0.01 -a 0.2 -o a.ref a.tab > a.opt.out
Here, only the medians of the best threshold values are shown:
Shared_I._0 0.93155205 0.06000000
Modif._RI_0 0.78873239 0.06000000
Rand_Ind._0 0.97777778 0.06000000
Shared_I._1 0.56099212 0.15000000
Modif._RI_1 0.48275862 0.15000000
Rand_Ind._1 0.73333333 0.15000000
Shared_I._2 0.61658242 0.06000000
Modif._RI_2 0.43571724 0.12000000
Rand_Ind._2 0.76666667 0.15000000
Clustering
To apply the threshold 0.06 (optimal with respect to the genera, see above) to the distances, we type:
optsil_linux -t -i 0.06 a.tab > a.opt006.out
Here are the results:
Object Cluster_0 Minimum_0 Average_0 Maximum_0 Spanning_0 Isolation_0 Cohesion_0
Alligator_sinensis 0.00000000 -1.00000000 -1.00000000 -1.00000000 -1.00000000 0.11836734 -1.00000000
Melanosuchus_niger 1.00000000 -1.00000000 -1.00000000 -1.00000000 -1.00000000 0.07346939 -1.00000000
Paleosuchus_trigonatus 3.00000000 0.04526749 0.02263375 0.04526749 0.04526749 0.14754099 -1.00000000
Alligator_mississippiensis 2.00000000 -1.00000000 -1.00000000 -1.00000000 -1.00000000 0.11836734 -1.00000000
Caiman_latirostris 4.00000000 0.05761317 0.02880658 0.05761317 0.05761317 0.07346939 -1.00000000
Paleosuchus_palpebrosus 3.00000000 0.04526749 0.02263375 0.04526749 0.04526749 0.14754099 -1.00000000
Caiman_crocodilus 4.00000000 0.05761317 0.02880658 0.05761317 0.05761317 0.07346939 -1.00000000
Crocodylus_rhombifer 5.00000000 -1.00000000 -1.00000000 -1.00000000 -1.00000000 0.12601626 -1.00000000
Gavialis_gangeticus 6.00000000 -1.00000000 -1.00000000 -1.00000000 -1.00000000 0.06274510 -1.00000000
Tomistoma_schlegelii 7.00000000 -1.00000000 -1.00000000 -1.00000000 -1.00000000 0.06274510 -1.00000000
That is, the two Caiman and the two Paleosuchus species are clustered together (column 2), whereas the two Alligator species are still separated. Because the other genera are represented by a single species only and placed in clusters of their own, the solution is optimal except for Alligator.
The columns 3-8 contain selected cluster statistics. -1 represents missing data. Except for the cluster isolation, the indices can only be calculated if a cluster comprises at least two objects. The cohesion can only be calculated if the cluster comprises at least three objects.
Cluster statistics for externally provided partitions
The same statistics may also be calculated for given partitions. The syntax is:
optsil_linux -t -i 0.03 -c a.ref a.tab > a.ref.out
The results are:
Object Cluster_0 Minimum_0 Average_0 Maximum_0 Spanning_0 Isolation_0 Cohesion_0 Cluster_1 Minimum_1 Average_1 Maximum_1 Spanning_1 Isolation_1 Cohesion_1
Alligator_sinensis 0.00000000 0.11836734 0.05918367 0.11836734 0.11836734 0.14754099 -1.00000000 0.00000000 0.04526749 0.44654958 0.21399178 0.11836734 0.15983607 -1.00000000
Melanosuchus_niger 1.00000000 -1.00000000 -1.00000000 -1.00000000 -1.00000000 0.07346939 -1.00000000 0.00000000 0.04526749 0.44654958 0.21399178 0.11836734 0.15983607 -1.00000000
Paleosuchus_trigonatus 2.00000000 0.04526749 0.02263375 0.04526749 0.04526749 0.14754099 -1.00000000 0.00000000 0.04526749 0.44654958 0.21399178 0.11836734 0.15983607 -1.00000000
Alligator_mississippiensis 0.00000000 0.11836734 0.05918367 0.11836734 0.11836734 0.14754099 -1.00000000 0.00000000 0.04526749 0.44654958 0.21399178 0.11836734 0.15983607 -1.00000000
Caiman_latirostris 3.00000000 0.05761317 0.02880658 0.05761317 0.05761317 0.07346939 -1.00000000 0.00000000 0.04526749 0.44654958 0.21399178 0.11836734 0.15983607 -1.00000000
Paleosuchus_palpebrosus 2.00000000 0.04526749 0.02263375 0.04526749 0.04526749 0.14754099 -1.00000000 0.00000000 0.04526749 0.44654958 0.21399178 0.11836734 0.15983607 -1.00000000
Caiman_crocodilus 3.00000000 0.05761317 0.02880658 0.05761317 0.05761317 0.07346939 -1.00000000 0.00000000 0.04526749 0.44654958 0.21399178 0.11836734 0.15983607 -1.00000000
Crocodylus_rhombifer 4.00000000 -1.00000000 -1.00000000 -1.00000000 -1.00000000 0.12601626 -1.00000000 1.00000000 0.06274510 0.10628089 0.13008130 0.12601626 0.15983607 -1.00000000
Gavialis_gangeticus 5.00000000 -1.00000000 -1.00000000 -1.00000000 -1.00000000 0.06274510 -1.00000000 1.00000000 0.06274510 0.10628089 0.13008130 0.12601626 0.15983607 -1.00000000
Tomistoma_schlegelii 6.00000000 -1.00000000 -1.00000000 -1.00000000 -1.00000000 0.06274510 -1.00000000 1.00000000 0.06274510 0.10628089 0.13008130 0.12601626 0.15983607 -1.00000000
Columns 2-8 contain the statistics for the 1st partition (here, the genera), columns 9-15 those for the 2nd partition (the families).
Jackknifing
For taxon jackknifing, use the following command-line options:
optsil_linux -t -n 100 -k 0.5 -i 0.0 -s 0.01 -a 0.2 -j a.ref a.tab > a.jack.out
Which results in this file (exact numeric values may deviate from run to run):
Reference Average Std.Dev. Median MedianAD Minimum Maximum Lw.Bnd.2 Up.Bnd.2 Lw.Bnd.1 Up.Bnd.1 SharedI.0 0.03465000 0.01378867 0.03000000 0.01000000 0.02000000 0.06500000 0.02000000 0.06500000 0.02000000 0.06000000 SharedI.0 0.98966376 0.04091201 1.00000000 0.00000000 0.82772938 1.00000000 0.82772938 1.00000000 0.82772938 1.00000000 Modif.RI0 0.06850000 0.03540833 0.06250000 0.02750000 0.03000000 0.14000000 0.03000000 0.13500000 0.03000000 0.13000000 Modif.RI0 0.97692308 0.09134109 1.00000000 0.00000000 0.61538462 1.00000000 0.61538462 1.00000000 0.61538462 1.00000000 RandInd.0 0.07440000 0.03758510 0.06500000 0.03000000 0.03000000 0.14000000 0.03000000 0.13500000 0.03000000 0.13500000 RandInd.0 0.99400000 0.02374868 1.00000000 0.00000000 0.90000000 1.00000000 0.90000000 1.00000000 0.90000000 1.00000000 SharedI.1 0.07080000 0.03442906 0.06500000 0.03000000 0.03000000 0.14000000 0.03000000 0.13500000 0.03000000 0.13000000 SharedI.1 0.79883506 0.31754245 1.00000000 0.00000000 0.00000000 1.00000000 0.00000000 1.00000000 0.00000000 1.00000000 Modif.RI1 0.15555000 0.03098302 0.16750000 0.01250000 0.07500000 0.18500000 0.07500000 0.18000000 0.07500000 0.18000000 Modif.RI1 0.85374472 0.26237907 1.00000000 0.00000000 0.13793103 1.00000000 0.28571429 1.00000000 0.28571429 1.00000000 RandInd.1 0.16555000 0.01774676 0.17250000 0.00750000 0.11000000 0.18500000 0.11500000 0.18000000 0.13500000 0.18000000 RandInd.1 0.93100000 0.12624975 1.00000000 0.00000000 0.60000000 1.00000000 0.60000000 1.00000000 0.60000000 1.00000000 SharedI.2 0.16730000 0.01722527 0.17500000 0.00500000 0.11000000 0.19000000 0.11500000 0.18500000 0.13500000 0.18000000 SharedI.2 0.68227435 0.06982679 0.68781350 0.02126933 0.41386469 0.75259805 0.50000000 0.75259805 0.50000000 0.75259805 Modif.RI2 0.09705000 0.03611160 0.10000000 0.02500000 0.03000000 0.17000000 0.03000000 0.15000000 0.03000000 0.14500000 Modif.RI2 0.54466068 0.05339852 0.50000000 0.00000000 0.45054945 0.64285714 0.50000000 0.64285714 0.50000000 0.64285714 RandInd.2 0.10405000 0.03654241 0.11500000 0.02500000 0.03000000 0.18000000 0.03000000 0.17000000 0.03000000 0.15000000 RandInd.2 0.76100000 0.07402027 0.75000000 0.05000000 0.55000000 0.85000000 0.55000000 0.85000000 0.60000000 0.85000000
Shown are the averages, standard deviations, medians, median absolute differences, minima, maxima, and two- and one-tailed upper and lower bounds of a 95% confidence interval for (i) the optimal threshold values from jackknifing and (ii) the corresponding values of the index used as optimality criterion. For this reason, the entries in first column, which name the reference partition and the criterion used, are listed twice, respectively. By default, a fraction of 1/e (approx. 37%) of the taxa is deleted per replicate. This is overwritten using -k. -n is used to set the number of jackknife replicates.
More verbose jackknife output (best values from each replicate) is produced using -v.
Miscellaneous
To use complete-linkage clustering, set the -f option to 1, e.g.:
optsil_linux -t -f 1.0 -i 0.0 -s 0.01 -a 0.2 -o a.ref a.tab > a.opt.out
optsil_linux -t -f 1.0 -i 0.06 a.tab > a.opt006.out
optsil_linux -t -f 1.0 -i 0.03 -c a.ref a.tab > a.ref.out
Setting -f to a value between 0.0 (single-linkage clustering, which is the default) and 1.0 results in an intermediate linkage clustering. For instance, -f 0.5 indicates that for two clusters to be fused it is required that at least half of the distances between the objects within two distinct clusters are lower than or equal to the threshold.
Separating the clustering approach from the calculation of the distance matrix gives the freedom to the user to select a distance formula suitable for the data at hand. Distances from aligned data can be calculated with PAUP* and PHYLIP, among many other programs. The former program requires input data in NEXUS format. Execute these data by typing commands at the PAUP* prompt:
execute 'example.nex';
and calculate and save, say, p distances:
dset distance = p missdist = ignore;
savedist format = tabtext triangle = both file = 'example_p.dst';
These can be read by optsil if the -t option is used. DNADIST and PROTDIST from the PHYLIP package will export distances in PHYLIP format, for which the -p switch should be applied instead of -t. Make sure that all distances are saved as squared (not triangular) matrices.
When reading distance data, optsil removes leading and trailing spaces and replaces all each internal spaces by an underscore. When reading csv data, optsil treats them in the same way after removing quotes surrounding the labels. Further adaptions of the names are not made. Accordingly, the use has to make sure that the sequence labels are otherwise identical to those in the file with the reference partition(s) (see next section).
Input partitions must be stored as tab-separated plain text files. Each column except for the first one has to represent a data partition. The first column must store the taxon labels, which have to be exactly identical to those present in the distance matrices except for spaces vs. underscores (for example, see the file a.ref). However, quotes surrounding the labels are removed from the tab-separated file (not from the distance file!). The order of the labels needs not to be identical between the two files. The optsil program can assist in creating an appropriate reference file. Enter
optsil_linux -t -l a.tab > a.csv
Then open the file a.csv in an spreadsheet program such as OpenOffice Calc or Microsoft Excel. Enter the cluster labels in the 2nd, 3rd, etc. columns. Make sure that you store the file in csv format, using tab characters as separators and double quotes (not single quotes!) as quoting characters. The file a2.csv has been created with OpenOffice Calc in that manner. You do not need the quotes, however. Their automated removal is only done by optsil to make it easier to work with spreadsheet programs.
Optsil can also be used to calculate the agreement between any pair of files with partitions. For instance, to compare 1.ref and 2.ref, enter:
optsil_linux -r 2.ref 1.ref > 1_vs_2.out
The format is identical to the output of -o.
Day, W.E.H. 1997. Validity of clusters formed by graph-theoretic cluster methods. Mathematical Biosciences 36: 299-317.
Estabrook, G.F. 1966. A mathematical model in graph theory for biological classification. Journal of Theoretical Biology 12: 297-310.
Estabrook, G.F. 1967. An information theory model for character analysis. Taxon 16: 86-97.
Felsenstein, J. 2005. PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. Department of Genome Sciences, University of Washington, Seattle.
Goeker, M., Garcia-Blazquez, G., Voglmayr, H., Telleria, M.T. and Martin, M.P. 2009. Molecular taxonomy of phytopathogenic fungi: a case study in Peronospora. PLoS ONE 4: e6319.
Hubert, L. and Arabie, P. 1985. Comparing partitions. Journal of Classification 2: 193-218.
Lanyon, S. 1985. Detecting internal inconsistencies in distance data. Systematic Zoology 34: 397-403.
Legendre, P. and Legendre, L. 1998. Numerical ecology (2nd English edn.). Amsterdam: Elsevier Science BV.
Rand, W.M. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66: 846-850.
Sokal, R.R. and Sneath, P.H.A. 1969. Principles of numerical taxonomy. W.H. Freeman, W.H. and Company, San Francisco.
Swofford D.L. 2002. PAUP*: Phylogenetic Analysis Using Parsimony (*and Other Methods), Version 4.0 b10. Sinauer Associates, Sunderland, MA.
Wirth, M., Estabrook, G.F. and Rogers, D.A. 1966. A Graph Theory Model for Systematic Biology, with an Example for the Oncidiinae (Orchidaceae). Systematic Zoology 15: 59-69.