Name
sxk_means_groups - determine 'best' number of clusters in the data using K-means classification of a set of images
Usage
Usage in command lines:
sxk_means_groups.py stackfile output_file <mask_file> --opt_method=k_means_method K1=start_number_of_cluster K2=stop_number_of_clusters --rand_seed=1000 --maxit=max_iter --trials=number_of_trials_of_k_means --crit=criterion_name --CTF --MPI
Usage in python programming:
k_means_groups(stackfile, output_file, mask=Image mask, opt_method=k_means_method, K1=start_number_of_cluster, K2=stop_number_of_clusters, rand_seed=1000, maxit=max_iter, trials=number_of_trials_of_k_means, crit=criterion_name, CTF)
To use MPI || version:
- 1. set the MPI flag in command line
- 2. mpirun -np 32 sxk_means_groups.py and the remaining parameters
- The above example is for mympi.
Input
- stackfile
- The input stack of images
- output_file
- text file in which values of clustering criteria are be stored
- mask
filename for input image mask. The input image are considered only for pixels mask that have value > 0.5. Note: has to have the same dimensions as the input (default = None, entire images will be used)
- K1
- minimum requested number of clusters
- K2
- maximum requested number of clusters
- trials
- number of trials of K-means (see description below) (default one trial)
- opt_method
- optimization method: 'SSE' or 'cla' (default is SSE) (see description below)
- CTF
- if set, CTF information stored in file headers will be used (default no CTF)
- rand_seed
- random seed of initial...to generate random numbers?...set to??
- crit
- names of criterion used: 'all' all criterion, 'C' Coleman, 'H' Harabasz or 'D' Davies-Bouldin. Prefered to use the three criterions 'CHD' in the same time and choose the number of clusters that satisfy all criterions, see below in description section. Possibility to composite free options, like 'H', 'CD', 'HC', or 'CDH', ...
- MPI
- to use MPI version of k-means groups
Output
- output_file
- text file will contain differents columns according the criterions choosed, for example if crit='CHD', the columns of numbers: (1) number of clusters, (2) values of Coleman criterion, (3) values of Harabasz criterion and (4) values of Davies-Bouldin criterion
- output_file.p
- file contain a gnuplot script, this file allow plot directly the values of all criterions with the same range. Use this command in gnuplot: load 'output_file.p'
- WATCH_GRP_KMEANS or WATCH_MPI_GRP_KMEANS
- file contain the progress of k-means groups. This file can be read in real-time to watch the evolution of criterions.
Description
- The command implements two minimization methods and two different algorithms depending on the CTF flag. In each case, random initialization is used, i.e., initially, images are randomly assigned to K classes.
- Minimization methods:
cla - direct K-means, in which class averages are updated after reassignment of each image. The method is fast, except for trivial cases it fails to find good assignment.
SSE - class averages are updated after reassignment of each object. The method is slower (in case of CTF it is painfully slow), but yields better classification results.
- The results of K-means classification are (in most cases) irreproducible, i.e., if classification is repeated for the same number of classes but using different initial assignment (as in this implementation), the result will be different. In order to find reproducible resuts one is advised to repeat K-means many times and accept the 'best' solution, as identified by the criterion value. For a sufficiently large number of trials and reasonable data, it is possible to find optimum solution. This process is facilitated by the number of trials user can provide: program will repeat classificiation specified number of times and return the best solution found.
Program calculates and returns values of classification quality: Coleman, Harabasz or Davies-Bouldin. When plotted agains number of clusters, for the number of clusters best reflecting data structure, Coleman should have local maximum while Harabasz should have local minimum and Davies-Bouldin have local minimum.
Reference
Pattern Classification II Edition - Richard O.Duda, Peter E.Hart, David G.Stork
Author / Maintainer
Julien Bert
Keywords
- category 1
- APPLICATIONS
Files
statistics.py, sxk_means_groups.py
See also
Maturity
- beta
- works for author, often works for others.
Bugs
None. It is perfect.
