classify - Unsupervised classification procedure using principle component analysis

SYNOPSIS

classify  [ parameter=value ]  [ inputfile outputfile ]
classify  [ parameter=value ]  [ inputfile ... directory ]
Parameters are: variablesclass_var_nameinput_mask_set, input_mask_var, resolutionmask_classes,   channel_minimumchannel_maximum,   target_classestarget_labels,   target_valuesntarget_deltasnlist_file,   merge_tolerancesingle_link,   save_clusterslist_clusters,   num_clusters,   percent_signif,    percent_max_sizmerge_tolerance.

DESCRIPTION

classify performs an unsupervised classification with principle components analysis to distinguish and label image features. Similar features/objects under similar environmental conditions tend to provide similar sensor responses, so that data values associated with specific image features are statistically clustered around mean values in each data channel which are most characteristic for those features. These data clusters are utilized to facilitate classification and labeling of image features.

classify identifies and labels data clusters in a scene/image using a multichannel binning and merging classification algorithm, followed by comparison and labeling of clusters with known classes. Clusters whose centers, or mean values, lie within a threshold distance of a user-specified "target feature" mean value, are grouped and labeled so that all image elements associated with targeted clusters (classes) are correspondingly labeled and masks highlighting target features can be generated. classify delivers an output product which consists of masks highlighting targeted image features/classes.

Classification of all image features is first performed by a simple sort of image pixels into data bins. Data bins are grid-cells comprising a grid superimposed onto the the multichannel spectral range being utilized. The grid cells should be significantly smaller than the expected dimensions (range of values) of the smallest data clusters sought for classification and labeling. The grid mesh dimensions (cell-counts) for each channel can be specified with the resolution parameter.

classify creates an output dataset with the output variable dataclasses of type byte. dataclasses contains a non-negative  integer value when the corresponding image pixel belongs to a target class, otherwise the value is set to badvalue (GP_BAD_BYTE). A valid integer value in dataclasses is the integer specified with the target_labels parameter, for the classes specified on the class_var_name parameter field. Target classes are specified by mean and delta values for the data variables utilized, thereby defining neighborhoods wherein target classes are to be found.

After initially sorting image pixels into their corresponding "bins", providing an initial set of data-clusters, these data-clusters are statistically processed and merged with adjacent clusters, on the basis of covariance information accumulated for each cluster, which allows assessment of similarity between adjacent clusters. Cluster pairs being most similar are merged. Two algorithms are provided for this purpose: A single-linkage method, and an average-linkage method.

The single-linkage algorithm identifies every cluster's most appropriate (closest, most similar) neighboring cluster, thereby specifying a series of cluster pairs which should be merged. Merging of all identified cluster pairs is performed in one pass, essentially simultaneously consolidating large chains of neighboring clusters.  With single-linkage, one pass may even be sufficient to consolidate all clusters into a single data cluster, depending upon the modality of the data distribution. This algorithm's performance is faster than N**2,  N being the initial count of populated grid-cells. Single linkage tends to consolidate clusters quite uniformly, but also may merge outlier data with larger, well-defined clusters.

The average-linkage algorithm differs from single-linkage in that only the closest cluster pair is merged before all affected remaining cluster pairs are reevaluated for closeness and similarity with the merge resultant cluster. This requires additional computational overhead for each merge, so that this algorithm's performance ranges between N**2 and N**3, N being the initial count of populated grid-cells. Average-linkage tends to consolidate large, classifiable data features most rapidly, but ignore outliers.

A cluster pair is determined to be mergeable (closest, most similar) by means of principle component analysis. A cluster pairs' similarity is tested on the basis of data overlap (adjacency and contiguity), which can be measured by comparing clusters' standard deviations with separation of their mean values, as assessed in terms of the principle component coordinates of each cluster.  Separation of a cluster pair's mean values is determined in terms of the average of the standard deviations of a cluster pair under consideration. Covariance matrices permit principle component transformations to estimate directional standard deviations, providing measure of data overlap for all relative cluster orientations. While it generally suffices to identify merely the closest neighbor for any given cluster, a difference threshold parameter merge_tolerance is provided to optionally insure that significantly unrelated clusters are not inadvertently consolidated. Additional control parameters provided are num_clusters, which specifies the final number of clusters sought to contain percent_signif percent of the data distribution, thereby specifying a termination criterium. Also, percent_max_siz is provided to limit the largest cluster size as a percentage of the data distribution, to prevent premature consolidation of large data clusters.

target_classes is the parameter to specify a sequence of target class labels and implicitly the corresponding sequence indexes for these target labels. target_valuesn specify the expected values for classes corresponding to each target category. A Neighborhood around these target_valuesn must also be defined with the target_deltasn parameter in order to provide a range of values within which feature classes are expected to cluster. Any classes falling within these ranges of values are identified and labeled as belonging to the target class:

target_values - target_deltas <= target class values <= target_values + target_deltas

Not specifying target classes and target labels suppresses generation of class data; the output dataset will then only  contain data pertinent to the resultant data clusters.

Attributes are written to the output dataset which pertain particularly to the specified classes. (dataclasses)_names (default) lists the class names, (dataclasses)_labels (default) lists the class labels which were specified with the target_classes and target_labels parameters.  These serve for reuse of the classification data in further processing and classification, such as when class data is eventually reused as an input mask, specified with the input_mask_var parameter. 

PARAMETERS

variables
Names of input variables contributing to the classification of data. All input variable names are specified as space delimited character strings on one input line. An attribute called var_names is saved in the output dataset containing this list of variable names.
No default.
class_var_name
OPTIONAL. Name of output variable (byte).
The default is class_var_name=dataclusters
target_classes
target_classes specifies label names associated with each target category of data classes. Names are character strings, space delimited, with one name per class label. An attribute called <class_var_name>_class_name is saved in the output dataset, containing this list of target class names.
No default.
target_labels
target_labels specifies integer labels associated with each target category of data classes. Values must be non-negative integer values, one value per class. Values may be repeated. target_labels are those values used to label each class in the output variable specified by class_var_name.
No default.
 
target_valuesvarn
target_valuesvarn specifies expected target values for all target classes and for each input variable (variable index=varn). This is facilitated by a loop over all input variables. Per input variable, expected values for all targets are specified as an array of input real values.
No default.
target_deltasvarn
target_deltasvarn specifies expected target value ranges for all target classes and for each input variable (variable index=varn). This is facilitated by a loop over all input variables. Per input variable, expected value deltas for all targets are specified as an array of input real values.
No default.
 
resolution
resolution specifies the data resolution per input channel, in corresponding units, for the initial assignment of data elements to the cluster-grid. The array of real values specifies the resolution limit values below which data elements might be misclassified because of indistinguishability. Resolution limits should be significantly smaller (~ 0.1 X) than the smallest relevent data ranges of the final clusters expected from the classification.
The default is resolution=[0.5 0.5 ... 0.5].
num_clusters
num_clusters specifies the desired number of clusters to contain the significant image feature data distributions. This parameter is to be used with respect to the percentage of data specified by the percent_signif parameter, which specifies the percentage of data which is to be distributed amongst the num_clusters largest clusters.
The default is num_clusters=10.
percent_signif
percent_signif specifies the desired percentage of data to be distributed amongst the num_clusters largest clusters. This parameter is to be used with respect to the number clusters specified by the num_clusters parameter, which specifies the desired number of clusters to contain the significant image feature data distributions.
The default is percent_signif=90.
percent_max_siz
percent_max_siz specifies the maximum allowable percentage of data to be consolidated into any one cluster. This parameter prevents premature cluster merging which would prevent feature identification.
The default is percent_max_siz=100.
merge_tolerance
merge_tolerance specifies a tolerance for merging clusters as a fraction of twice their average standard deviations. Clusters separated by distances less than this fraction of twice their average standard deviations are considered for merging.
The default is merge_tolerance=10.
single_link
single_link enables the single-linkage algortithm instead of the default average-linkage method.
The default is single_link=no.
save_clusters
save_clusters enables the saving of generated cluster data to the output file. save_clusters=yes allows reuse of the results of the split-and-merge clustering algorithm for further class labeling, thereby removing the need to rerun that portion of the classification procedure. The output variables generated for this purpose are cluster_means, cluster_stddev, cluster_min, cluster_max, cluster_datacount, cluster_index, imagedataclusters. Use procedure reclassify to classify anew using these output variables.
The default is save_clusters=no.
list_clusters
list_clusters enables displaying of cluster information upon completion. Cluster sizes, mean, min, max, and covariance information are displayed.
The default is list_clusters=no.
channel_minimum
OPTIONAL. Specifies the lowest acceptable input values for each input channel. Input data values less than channel_min values are ignored and handled as bad values.
The default is -10000 for each channel.
channel_maximum
OPTIONAL. Specifies the highest acceptable input values for each input channel. Input data values greater than channel_max values are ignored and handled as bad values.
The default is 10000 for each channel.
input_mask_set
Specifies an optional dataset name containing a variable with a class mask for highlighting/selecting subsets of image pixels. Normally this dataset with a class mask is the output of a previous classification performed upon the same input data.
The default is input_mask_set="".
input_mask_var
Specifies an optional variable name in the dataset given by input_mask_set, if specified, otherwise in the input imagery dataset, which contains an input class mask for highlighting/selecting subsets of the image pixels.
The default is input_mask_var="".
mask_classes
Relevent upon specification of a variable name with input_mask_var. Specifies a list of class names associated with the input class mask given by input_mask_var which defines desired input data subsets for classification. Only data associated with these class names will be used for classificaion. The excluded classes' data is ignored.
The default is mask_classes="".
EXAMPLES

This example labels image features for three class categories. Note that class merging proceeds iteratively until no classes remain to be merged.

% classify
in/out files   : char(255) ? g8.97140.1700.reg g8.97140.1700.class
variables      : char(255) ? gvar_ch1 gvar_ch2 gvar_ch5
single_link    : char(  3) ? [no] y
input_mask_set : char(255) ? []
input_mask_var : char(255) ? []
save_clusters  : char(  3) ? [no] y
list_clusters  : char(  3) ? [no] y
list_file      : char(255) ? [clusters.list]
target_classes : char(255) ? class1 class2 class3
target_labels  : int(  3)  ? [0 1 2] 1 2 3
targets for variable gvar_ch1:
target_values1  : real(  3) ? 20 40 55
target_deltas1  : real(  3) ? 10 10 5
targets for variable gvar_ch2:
target_values2  : real(  3) ? -5 5 5
target_deltas2  : real(  3) ? 30 30 30
targets for variable gvar_ch5:
target_values3  : real(  3) ? -5 -15 -25
target_deltas3  : real(  3) ? 5 5 5
resolution     : real(  3) ? [0.5 0.5 0.5] 0.3 0.6 0.4
num_clusters   : int       ? [10]  20
percent_signif : int       ? [95]  95
percent_max_size : int     ? [100]
merge_tolerance: real      ? [10.] 10.
Initializing image data
Initializing output data
Initializing output variable
Initializing clusters
Initializing grid
Looping over image elements:
******
Marking empty grid-cells
1562 clusters before merging.
loop to merge clusters:
****************
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
loop to merge clusters:
****
+++++++++++++++++++++++++++++++++++++++++++++
loop to merge clusters:
*
++++++++++++++++++++++++++
 Total clusters counted before sorting = 48
 Total clusters counted after sorting = 48
********************************
Cluster statistics:
Number  Datasize  Mean        Min        Max         Covariance | inverse Covariance
------  --------  ----        ----       ----        -------------------------------
0       79909     7.61293     1.89000    13.40000     4.51721     0.05035    -0.07509
                -32.45867   -39.29000   -26.98000    -0.02759     7.53153    -0.06313
                 19.47752     7.93000    33.90000     3.52624     4.94731    16.48094

1       36874    19.43652     3.66000    37.59000    43.85708    -0.02970     0.01643
                -39.15960   -49.58000   -28.87000     4.78384     7.76362    -0.04223
                -12.23012   -37.73000     2.94000   -10.94069     5.25922    29.57635

2       36789     3.91014     1.11000     7.25000     2.22222     0.01481     0.11266
                -29.65788   -33.37000   -23.76000    -0.81703     4.96106    -0.14394
                 12.62985    -3.53000    19.38000    -1.45396     3.18554     6.41070
 :        :          :          :             :          :         :            :
 :        :          :          :             :          :         :            :
 :        :          :          :             :          :         :            :
 :        :          :          :             :          :         :            :
 :        :          :          :             :          :         :            :
 
Total clusters counted while labeling = 48
********************************
g9.97241.1700.class: classification completed.

NOTES

When labeling classes, it may be necessary to include amongst the target-classes one or several classes which serve to identify features which are similar (within the range of defined class values) yet distinct from the sought classes. This protects from incorrect labeling when class ranges are expected to overlap.

SEE ALSO

cloudprod, cloudmask, reclassify

Last Update: $Date: 2000/12/07 19:55:11 $