The NPAR1WAY Procedure

Exact Tests

PROC NPAR1WAY provides exact p-values for the following nonparametric location and scale tests: Wilcoxon, median, van der Waerden (normal), Savage, Siegel-Tukey, Ansari-Bradley, Klotz, Mood, and Conover. The procedure also provides exact p-values for tests that use the raw input data as scores. These exact tests are available for two-sample and multisample data. The two-sample exact tests are based on simple linear rank statistics, and the multisample exact tests are based on one-way ANOVA statistics.

For two-sample data, PROC NPAR1WAY also provides the exact Kolmogorov-Smirnov test, the exact ratio mean deviations (RMD) test, and exact Hodges-Lehmann confidence limits for the location shift.

Exact tests can be useful in situations where the asymptotic assumptions are not met and therefore the asymptotic p-values might not be close approximations for the true p-values. Standard asymptotic methods involve the assumption that the test statistic follows a particular distribution when the sample size is sufficiently large. When the sample size is not large, asymptotic results might not be valid. Asymptotic results might also be unreliable when the distribution of the data is sparse, skewed, or heavily tied. For more information, see Agresti (2007) and Bishop, Fienberg, and Holland (1975). Exact computations are based on the statistical theory of exact conditional inference for contingency tables, which is reviewed by Agresti (1992).

In addition to the computation of exact p-values, PROC NPAR1WAY provides the option to estimate exact p-values by Monte Carlo simulation. This can be useful for large problems where exact computations require a substantial amount of time and memory but asymptotic approximations might not be sufficient.

The following sections summarize the exact computational algorithms, define the exact p-values that PROC NPAR1WAY computes, discuss the computational resource requirements, and describe the Monte Carlo estimation option.

Computational Algorithms

PROC NPAR1WAY computes exact p-values for nonparametric location and scale tests by using the network algorithm, which was developed by Mehta and Patel (1983). This algorithm provides a substantial advantage over direct enumeration, which can be very time-consuming and feasible only for small problems. See Agresti (1992) for a review of algorithms for computation of exact p-values, and see Mehta, Patel, and Tsiatis (1984) and Mehta, Patel, and Senchaudhuri (1991) for information about the performance of the network algorithm.

To implement the network algorithm, PROC NPAR1WAY constructs a contingency table from the input data; the contingency table rows correspond to the levels of the classification variable and the table columns correspond to the levels of the response variable. The reference set for the observed contingency table includes all tables that have the same marginal row and column sums as the observed table. Corresponding to this reference set, the network algorithm forms a directed acyclic network that consists of nodes in a number of stages. A path through the network corresponds to a distinct table in the reference set. The distances between nodes are defined so that the total distance of a path through the network is the corresponding value of the test statistic. At each node, the algorithm computes the shortest and longest path distances for all paths that pass through that node. For the two-sample linear rank statistics, which can be expressed as linear combinations of cell frequencies multiplied by increasing row and column scores, PROC NPAR1WAY computes shortest and longest path distances by using the algorithm of Agresti, Mehta, and Patel (1990). For the multisample one-way test statistics, PROC NPAR1WAY computes an upper bound for the longest path and a lower bound for the shortest path by following the approach of Valz and Thompson (1994).

The longest and shortest path distances (bounds) for a node are compared to the value of the observed test statistic to determine whether all paths through the node contribute to the p-value, no paths through the node contribute to the p-value, or neither of these situations occurs. If all paths through the node contribute, the p-value is incremented accordingly, and these paths are eliminated from further analysis. If no paths contribute, these paths are eliminated from further analysis. Otherwise, the algorithm continues to process this node and the associated paths. The algorithm finishes when all nodes have been accounted for.

PROC NPAR1WAY performs the network algorithm by using full numerical precision to represent all statistics, row and column scores, and other quantities in the computations. Although it is possible to use rounding to improve the speed and memory requirements of the algorithm, PROC NPAR1WAY does not use rounding because it might reduce the accuracy of the results.

Definition of p-Values

For two-sample linear rank tests, PROC NPAR1WAY computes exact one-sided and two-sided p-values for each test that you specify in the EXACT statement. For one-sided tests, PROC NPAR1WAY displays the right-sided p-value when the observed value of the test statistic is greater than its expected value. The right-sided p-value is the sum of probabilities for those tables with a test statistic that is greater than or equal to the observed test statistic. Otherwise, when the observed test statistic is less than or equal to its expected value, PROC NPAR1WAY displays the left-sided p-value. The left-sided p-value is the sum of probabilities for those tables with a test statistic that is less than or equal to the observed value. The one-sided p-value can be expressed as

upper P 1 left-parenthesis t right-parenthesis equals StartLayout Enlarged left-brace 1st Row normal upper P normal r normal o normal b left-parenthesis Test Statistic greater-than-or-equal-to t right-parenthesis normal i normal f t greater-than normal upper E 0 left-parenthesis upper T right-parenthesis 2nd Row normal upper P normal r normal o normal b left-parenthesis Test Statistic less-than-or-equal-to t right-parenthesis normal i normal f t less-than-or-equal-to normal upper E 0 left-parenthesis upper T right-parenthesis EndLayout

where t is the observed value of the test statistic and is the expected value of the test statistic under the null hypothesis. PROC NPAR1WAY computes the two-sided p-value as the sum of the one-sided p-value and the corresponding area in the opposite tail of the distribution of the statistic, equidistant from the expected value. The two-sided p-value can be expressed as

upper P 2 left-parenthesis t right-parenthesis equals normal upper P normal r normal o normal b left-parenthesis StartAbsoluteValue Test Statistic minus normal upper E 0 left-parenthesis upper T right-parenthesis EndAbsoluteValue greater-than-or-equal-to StartAbsoluteValue t minus normal upper E 0 left-parenthesis upper T right-parenthesis EndAbsoluteValue right-parenthesis

Tests for multisample data are based on one-way ANOVA statistics. For a test of this form, large values of the test statistic indicate a departure from the null hypothesis; the test is inherently two-sided. The exact p-value is the sum of probabilities for those tables having a test statistic greater than or equal to the value of the observed test statistic.

If you specify the POINT option in the EXACT statement, PROC NPAR1WAY provides point probabilities for the exact tests. The point probability is the exact probability that the test statistic equals the observed value. For two-sample data, PROC NPAR1WAY provides point probabilities for the one-sided tests of the linear rank statistics. For multisample data, PROC NPAR1WAY provides point probabilities for the one-way ANOVA tests.

If you specify the MIDP option in the EXACT statement, PROC NPAR1WAY provides exact mid p-values. The exact mid p-value is defined as the exact p-value minus half the exact point probability, which equals the average of and for a right-sided test. The exact mid p-value is smaller and less conservative than the nonadjusted exact p-value. For more information, see Agresti (2013, section 1.1.4) and Hirji (2006, sections 2.5 and 2.11.1). For two-sample data, PROC NPAR1WAY provides mid p-values for the one-sided tests of the linear rank statistics. For multisample data, PROC NPAR1WAY provides mid p-values for the one-way ANOVA tests.

Computational Resources

PROC NPAR1WAY uses relatively fast and efficient algorithms for exact computations. These algorithms, together with improvements in computing power, make it feasible to perform exact computations for data where previously only asymptotic methods could be applied. Nevertheless, depending on your available computing resources, exact computations for some very large problems might require a prohibitive amount of time and memory. For such large problems, consider whether exact methods are really needed or whether asymptotic methods might give results that are very close to the exact results while requiring much less computing time and memory. When asymptotic methods might not be sufficient for such large problems, consider using Monte Carlo estimation of exact p-values, which is described in the section Monte Carlo Estimation.

There is no formula that can predict in advance how much time and memory are needed to compute an exact p-value for a specific data set and test. The time and memory requirements depend on several factors, which include the following: the total number of observations, the number of rows (class levels) and columns (response levels), the particular arrangement of the observations, and the test to be performed. Generally, larger problems (in terms of total sample size, number of rows, and number of columns) tend to require more time and memory. For a fixed total sample size, time and memory requirements tend to increase as the number of rows and columns increase because this corresponds to an increase in the number of tables in the reference set. For a fixed total sample size, time and memory requirements also tend to increase as the marginal row and column sums become more homogeneous. For more information, see Agresti, Mehta, and Patel (1990) and Gail and Mantel (1977).

While PROC NPAR1WAY is computing an exact p-value, you can terminate the computation by pressing the system interrupt key sequence and choosing to stop computations. For more information, see the SAS Companion for your system. After you terminate an exact computation, PROC NPAR1WAY completes all other remaining tasks. The procedure reports missing values for any exact p-values that were not computed before termination.

To limit the amount of time that PROC NPAR1WAY uses for exact computations, you can specify the MAXTIME= option in the EXACT statement. This option sets the maximum amount of clock time (in seconds) that PROC NPAR1WAY can use to compute an exact p-value. If PROC NPAR1WAY does not finish an exact computation in the time that you specify, the procedure terminates the computation and completes the remaining tasks.

Monte Carlo Estimation

When you specify the MC option in the EXACT statement, PROC NPAR1WAY computes Monte Carlo estimates of exact p-values. Monte Carlo estimation can be useful for large problems where exact computations require a substantial amount of time and memory but asymptotic approximations might not be sufficient.

To describe the precision of a Monte Carlo estimate, PROC NPAR1WAY provides the asymptotic standard error and % confidence limits. You can specify the confidence level in the ALPHA= option in the EXACT statement; by default, ALPHA=0.01, which produces 99% confidence limits.

You can specify the number of Monte Carlo samples by using the N=n option in the EXACT statement. By default, PROC NPAR1WAY uses 10,000 samples to compute a Monte Carlo estimate. To improve the precision of the Monte Carlo estimates, you can specify a larger value of n; this increases the computation time because more samples are generated. To reduce the computation time, you can specify a smaller value of n.

PROC NPAR1WAY computes a Monte Carlo estimate of an exact p-value by generating a random sample of tables from the reference set for the exact test. The reference set is based on the contingency table of the input data; the contingency table rows correspond to the levels of the classification variable and the table columns correspond to the levels of the response variable. The reference set includes all tables that have the same marginal row and column sums as the observed table.

PROC NPAR1WAY generates a random sample of tables from the reference set by using the algorithm of Agresti, Wackerly, and Boyett (1979), which generates tables in proportion to their hypergeometric probabilities conditional on the marginal frequencies. For each sample table, PROC NPAR1WAY computes the value of the test statistic and compares it to the value of the observed test statistic. To estimate a right-sided p-value, PROC NPAR1WAY counts all sample tables for which the test statistic is greater than or equal to the observed test statistic. The estimate of the p-value is the number of these tables divided by the total number of sample tables, which can be expressed as

StartLayout 1st Row 1st Column ModifyingAbove p With caret Subscript normal m normal c 2nd Column equals 3rd Column m slash n 2nd Row 1st Column m 2nd Column equals 3rd Column number of samples for which left-parenthesis test statistic greater-than-or-equal-to t Subscript o Baseline right-parenthesis 3rd Row 1st Column n 2nd Column equals 3rd Column total number of samples 4th Row 1st Column t Subscript o 2nd Column equals 3rd Column observed test statistic EndLayout

PROC NPAR1WAY computes estimates of left-sided and two-sided exact p-values similarly. For left-sided exact p-values, PROC NPAR1WAY evaluates whether the sample test statistics are less than or equal to the observed test statistic. For two-sided exact p-values, PROC NPAR1WAY compares sample test statistics to the observed test statistic by using the definition of the two-sided p-value () for the test. For more information, see the section Definition of p-Values and descriptions of the individual tests.

The variable m has a binomial distribution with n trials and success probability p. The asymptotic standard error of the Monte Carlo estimate is

normal s normal e left-parenthesis ModifyingAbove p With caret Subscript normal m normal c Baseline right-parenthesis equals StartRoot ModifyingAbove p With caret Subscript normal m normal c Baseline left-parenthesis 1 minus ModifyingAbove p With caret Subscript normal m normal c Baseline right-parenthesis slash left-parenthesis n minus 1 right-parenthesis EndRoot

PROC NPAR1WAY constructs asymptotic confidence limits for the exact p-value as

ModifyingAbove p With caret Subscript normal m normal c Baseline plus-or-minus left-parenthesis z Subscript alpha slash 2 Baseline times normal s normal e left-parenthesis ModifyingAbove p With caret Subscript normal m normal c Baseline right-parenthesis right-parenthesis

where is the th percentile of the standard normal distribution, and the confidence level is determined by the ALPHA= option in the EXACT statement.

When the Monte Carlo estimate is 0, PROC NPAR1WAY computes confidence limits for the p-value as

left-parenthesis 0 comma 1 minus alpha Superscript left-parenthesis 1 slash n right-parenthesis Baseline right-parenthesis

When the Monte Carlo estimate is 1, PROC NPAR1WAY computes confidence limits for the p-value as

left-parenthesis alpha Superscript left-parenthesis 1 slash n right-parenthesis Baseline comma 1 right-parenthesis

Last updated: December 09, 2022