The ACECLUS Procedure

PROC ACECLUS Statement

  • PROC ACECLUS PROPORTION=p |THRESHOLD=t <options>;

The PROC ACECLUS statement invokes the ACECLUS procedure. Table 2 summarizes the options available in the PROC ACECLUS statement. These options are also discussed in the following sections. Note that, if you specify the METHOD=COUNT option, you must specify either the PROPORTION= or the MPAIRS= option. Otherwise, you must specify either the PROPORTION= or THRESHOLD= option.

Table 2: Summary of PROC ACECLUS Statement Options

Options Description
Specify Clustering Options
METHOD= Specifies the clustering method
MPAIRS= Specifies number of pairs for estimating within-cluster covariance (when you specify the option METHOD=COUNT)
PROPORTION= Specifies proportion of pairs for estimating within-cluster covariance
THRESHOLD= Specifies the threshold for including pairs in the estimation of the within-cluster covariance
Specify Input and Output Data Sets
DATA= Specifies input data set name
OUT= Specifies output data set name
OUTSTAT= Specifies output data set name containing various statistics
Specify Iteration Options
ABSOLUTE Uses absolute instead of relative threshold
CONVERGE= Specifies convergence criterion
INITIAL= Specifies initial estimate of within-cluster covariance matrix
MAXITER= Specifies maximum number of iterations
METRIC= Specifies metric in which computations are performed
SINGULAR= Specifies singularity criterion
Specify Canonical Analysis Options
N= Specifies number of canonical variables
PREFIX= Specifies prefix for naming canonical variables
Control Displayed Output
NOPRINT Suppresses the display of the output
PLOTS= Specifies ODS Graphics details
PP Produces a P-P plot of distances between pairs from last iteration
QQ Produces a Q-Q plot of power transformation of distances between pairs from last iteration
SHORT Omits all output except for iteration history and eigenvalue table


The following list provides details about the options.

ABSOLUTE

causes the THRESHOLD= value or the threshold computed from the PROPORTION= option to be treated absolutely rather than relative to the root mean square distance between observations. Use the ABSOLUTE option only when you are confident that the initial estimate of the within-cluster covariance matrix is close to the final estimate, such as when the INITIAL= option specifies a data set created by a previous execution of PROC ACECLUS by using the OUTSTAT= option.

CONVERGE=c

specifies the convergence criterion. By default, CONVERGE= 0.001. Iteration stops when the convergence measure falls below the value specified by the CONVERGE= option or when the iteration limit as specified by the MAXITER= option is exceeded, whichever happens first.

DATA=SAS-data-set

specifies the SAS data set to be analyzed. By default, PROC ACECLUS uses the most recently created SAS data set.

INITIAL=name

specifies the matrix for the initial estimate of the within-cluster covariance matrix. Valid values for name are as follows:

DIAGONAL | D

uses the diagonal matrix of sample variances as the initial estimate of the within-cluster covariance matrix.

FULL | F

uses the total-sample covariance matrix as the initial estimate of the within-cluster covariance matrix.

IDENTITY | I

uses the identity matrix as the initial estimate of the within-cluster covariance matrix.

INPUT=SAS-data-set

specifies a SAS data set from which to obtain the initial estimate of the within-cluster covariance matrix. The data set can be TYPE=CORR, COV, UCORR, UCOV, SSCP, or ACE, or it can be an ordinary SAS data set. See Appendix A, Special SAS Data Sets, for descriptions of CORR, COV, UCORR, UCOV, and SSCP data sets. See the section Output Data Sets for a description of ACE data sets.

If you do not specify the INITIAL= option, the default is the matrix specified by the METRIC= option. If neither the INITIAL= nor the METRIC= option is specified, INITIAL=FULL is used if there are enough observations to obtain a nonsingular total-sample covariance matrix; otherwise, INITIAL=DIAGONAL is used.

MAXITER=n

specifies the maximum number of iterations. By default, MAXITER=10.

METHOD=COUNT | C | THRESHOLD | T

specifies the clustering method. The METHOD=THRESHOLD option requests a method (also the default) that uses all pairs closer than a given cutoff value to form the estimate at each iteration. The METHOD=COUNT option requests a method that uses a number of pairs, m, with the smallest distances to form the estimate at each iteration.

METRIC=name

specifies the metric in which the computations are performed, implies the default value for the INITIAL= option, and specifies the matrix bold upper Z used in the formula for the convergence measure e Subscript i and for checking singularity of the bold upper A matrix. Valid values for name are as follows:

DIAGONAL | D

uses the diagonal matrix of sample variances diagleft-parenthesis bold upper S right-parenthesis and sets bold upper Z equals diag left-parenthesis bold upper S right-parenthesis Superscript negative one-half, where the superscript negative one-half indicates an inverse factor.

FULL | F

uses the total-sample covariance matrix bold upper S and sets bold upper Z equals bold upper S Superscript negative one-half.

IDENTITY | I

uses the identity matrix bold upper I and sets bold upper Z equals bold upper I.

If you do not specify the METRIC= option, METRIC=FULL is used if there are enough observations to obtain a nonsingular total-sample covariance matrix; otherwise, METRIC=DIAGONAL is used.

The option METRIC= is rather technical. It affects the computations in a variety of ways, but for well-conditioned data the effects are subtle. For most data sets, the METRIC= option is not needed.

MPAIRS=m

specifies the number of pairs to be included in the estimation of the within-cluster covariance matrix when METHOD=COUNT is requested. The values of m must be greater than 0 but less than or equal to (totfqtimes(totfq –1)) / 2, where totfq is the sum of nonmissing frequencies specified in the FREQ statement. If there is no FREQ statement, totfq equals the number of total nonmissing observations.

N=n

specifies the number of canonical variables to be computed. The default is the number of variables analyzed. N=0 suppresses the canonical analysis.

NOPRINT

suppresses the display of all output. Note that this option temporarily disables the Output Delivery System (ODS). For more information, see Chapter 23, Using the Output Delivery System.

OUT=SAS-data-set

creates an output SAS data set that contains all the original data as well as the canonical variables having an estimated within-cluster covariance matrix equal to the identity matrix. If you want to create a SAS data set in a permanent library, you must specify a two-level name. For more information about permanent libraries and SAS data sets, see SAS Programmers Guide: Essentials.

OUTSTAT=SAS-data-set

specifies a TYPE=ACE output SAS data set that contains means, standard deviations, number of observations, covariances, estimated within-cluster covariances, eigenvalues, and canonical coefficients. If you want to create a SAS data set in a permanent library, you must specify a two-level name. For more information about permanent libraries and SAS data sets, see SAS Programmers Guide: Essentials.

PLOTS <=plot-request>
PLOTS <=(plot-request <…plot-request>)>

specifies options that control the details of the plots. When you specify only one plot request, you can omit the parentheses.

ODS Graphics must be enabled before plots can be requested. For example:

ods graphics on;

proc aceclus plots=all;
   var _numeric_;
run;

ods graphics off;

For more information about enabling and disabling ODS Graphics, see the section Enabling and Disabling ODS Graphics in Chapter 24, Statistical Graphics Using ODS.

The plot-requests include the following:

ALL

produces both P-P and Q-Q plots.

NONE

suppresses all plots.

PP

produces a P-P plot of distances between pairs of observations computed in the last iteration.

QQ

produces a Q-Q plot of a power transformation of the distances between pairs of observations computed in the last iteration. Caution: The Q-Q plot can require an enormous amount of computer time.

When ODS Graphics is not enabled, you can specify the PP and QQ options (but not PLOTS=(PP QQ)) to create printer plots.

By default, no plots are produced.

PROPORTION=p
PERCENT=p
P=p

specifies the percentage of pairs to be included in the estimation of the within-cluster covariance matrix. The value of p must be greater than 0. If p is greater than or equal to 1, it is interpreted as a percentage and divided by 100; PROPORTION=0.02 and PROPORTION=2 are equivalent. When you specify METHOD=THRESHOLD, a threshold value is computed from the PROPORTION= option under the assumption that the observations are sampled from a multivariate normal distribution.

When you specify METHOD=COUNT, the number of pairs, m, is computed from PROPORTION=p as

m equals floor left-parenthesis StartFraction p Over 2 EndFraction times t o t f q times left-parenthesis t o t f q minus 1 right-parenthesis right-parenthesis

where totfq is the number of total nonmissing observations.

PP

produces a P-P plot of distances between pairs of observations computed in the last iteration.

PREFIX=name

specifies a prefix for naming the canonical variables. By default the names are Can1, Can2, …, CANN. If you specify PREFIX=ABC, the variables are named ABC1, ABC2, ABC3, and so on. The number of characters in the prefix plus the number of digits required to designate the variables should not exceed the name length defined by the VALIDVARNAME= system option. For more information about the VALIDVARNAME= system option, see SAS System Options: Reference.

QQ

produces a Q-Q plot of a power transformation of the distances between pairs of observations computed in the last iteration. Caution: The Q-Q plot can require an enormous amount of computer time.

SHORT

omits all items from the standard output except for the iteration history and the eigenvalue table.

SINGULAR=g
SING=g

specifies a singularity criterion 0 < g < 1 for the total-sample covariance matrix bold upper S and the approximate within-cluster covariance estimate bold upper A. The default is SINGULAR=1E–4.

THRESHOLD=t
T=t

specifies the threshold for including pairs of observations in the estimation of the within-cluster covariance matrix. A pair of observations is included if the Euclidean distance between them is less than or equal to t times the root mean square distance computed over all pairs of observations.

Last updated: December 09, 2022