The HPSPLIT Procedure

PARTITION Statement

PARTITION <partition-options>;

The PARTITION statement specifies how observations in the predictor data set are logically partitioned into disjoint subsets for model testing, training, and validation. Either you can designate a variable in the predictor data set and a set of formatted values of that variable to determine the role of each observation, or you can specify proportions to use for random assignment of observations to each role.

Note: Attempting to use a PARTITION statement along with cross validation results in an error.

You can specify only one of the following partition-options:

FRACTION(TEST=fraction VALIDATE=fraction <SEED=number>)

requests that specified proportions of the observations in the predictor data set be randomly assigned to testing, training, and validation roles. You specify the proportions for testing and validation by using the VALIDATE= and the TEST= suboptions. The SEED= suboption sets the seed. Because fraction is a per-observation probability, setting fraction too low can result in an empty or nearly empty testing or validation set. If you omit either the TEST= or VALIDATE= suboption, then the partition statement is ignored.

Using the FRACTION option can cause different numbers of observations to be selected for the validation set because this option specifies a per-observation probability. Different partitions can be observed when the number of nodes or threads changes or when PROC HPSPLIT runs in alongside-the-database mode.

The following PARTITION statement shows how to use a probability of choosing a particular observation for the validation set:

partition fraction(validate=0.1 seed=1234);

In this example, any particular observation has a probability of 10% of being selected for the validation set. All nonselected records are in the training set. The SEED= suboption specifies the seed that is used for the random number generator. If you omit this suboption, the seed is obtained by generating the time of day from the computer’s clock.

ROLEVAR=variable (<TEST='value'> TRAIN='value' VALIDATE='value')

names the variable in the predictor data set whose values are used to assign roles to each observation. You specify the formatted values of this variable, which are used to assign observation roles, in the TRAIN= and VALIDATE= suboptions. You can also assign observations to the test role by specifying the TEST= suboption. You must specify the TRAIN= suboption and either the VALID= or TEST= suboption.

In the following example, the ROLEVAR= option specifies _PARTIND_ as the variable in the predictor data set that is used to select the data set:

partition rolevar=_partind_(TRAIN='1' VALIDATE='0');

The TEST=, TRAIN=, and VALIDATE= suboptions provide the values that indicate whether an observation is in the testing, training, or validation sets, respectively. Observations in which the variable is missing or a value corresponds to none of the arguments are ignored. Formatting and normalization are performed before comparison, so you should specify numeric variable values as formatted values, as in the preceding example.

Last updated: December 09, 2022