Because the cumulative distribution function (CDF) is the inverse of the quantile function, you can estimate the CDF and other CDF-based statistics by inverting the relevant quantile function estimates. The CONDDIST statement in the QUANTREG procedure performs this type of analysis by estimating the conditional and marginal probability functions for the response random variable . These probability functions include the following:
for the conditional CDF and
for the conditional probability density function (PDF) that can be estimated using the model
for observable marginal CDF and
for the observable marginal PDF that can be estimated from the observed response values without using the model
for the predictable marginal CDF and
for the counterfactual marginal PDF that can be estimated from the quantile process predictions of the response variable and with the covariates
integrated out
The QUANTREG procedure estimates the conditional quantile function, , by using the fast quantile process regression (FQPR) algorithm. If you specify the QUANTILE= FQPR option in the MODEL statement, the CONDDIST statement uses the FQPR parameter estimates
to estimate
. For more information about the FQPR option, see the section Fast Quantile Process Regression. If the QUANTILE=FQPR option is not used in the MODEL statement, the first CONDDIST statement uses the FQPR algorithm to compute
for all the CONDDIST statements. This
is equal to the
that is output from the QUANTILE=FQPR option when you use the default FQPR suboption values.
For purposes of conceptual and computational simplicity, the CONDDIST statement estimates the quantile functions on a quantile-level grid
, where
is the lower end of the grid and
is the step length of the grid. You can specify
and q by respectively using the L=value and N=n suboptions of the QUANTILE= FQPR option in the MODEL statement. You can also specify the upper end of the grid
by using the U=value suboption of the QUANTILE=FQPR option in the MODEL statement, such that
By default, the size of the grid q is the smaller number between 100 and half the number of training observations in the DATA= data set. And by default, ,
, and
.
The estimated quantile function is also called a CDF sample for
.
The QUANTREG procedure computes the quantile function estimates on the DATA= data set in the PROC QUANTREG statement. To test this
for more observations, you can specify a separate data set by using the TESTDATA= data set option in the CONDDIST statement. Without assuming that the TESTDATA= data set and the DATA= data set share the same conditional distribution for
, the CONDDIST statement estimates counterfactual probability functions for the response variable
of the TESTDATA= data set that impose the quantile function
of the DATA= data set on the TESTDATA= data set.
The CONDDIST statement estimates by inverting the estimated quantile function on the quantile-level grid
, such that the estimated
satisfies
By definition, a quantile function must be nondecreasing, so that
for any
. However, quantile regression estimates could result in crossed quantile predictions. In other words, it is possible that
for some
and some
. This predicament is called crossing. To avoid crossing, the CONDDIST statement defines
as the jth-smallest prediction among
. Therefore, it is possible that
for some
and
.
For an individual observation, the CONDDIST statement assigns the type “Fit for Obs” to this conditional CDF sample and labels this sample by using the observation ID value (if available) or the observation index.
You can also request the conditional CDF sample at average for
by using the SHOWAVG option for the training average or the TESTDATA(SHOWAVG)= option for the test average (or both).
The CONDDIST statement assigns the type “Fit at Average” to these conditional CDF samples. For training data, the CONDDIST statement labels this CDF sample as “TrainAvg” and assigns the average of the training response values as its response value. For test data, the CONDDIST statement labels this CDF sample as “TestAvg” and assigns the average of the test response values as its response value.
The CONDDIST statement estimates the observable marginal by inverting the empirical quantile function
of the response variable.
Let denote the sorted response values for either the training data that you can specify by using the DATA= data set in the PROC QUANTREG statement or the test data that you can specify by using the TESTDATA= data set in the CONDDIST statement. If you assign the quantile level
to the response value
, then
is defined as
To be consistent with the quantile-level grid that is used for estimating the conditional probability functions, the CONDDIST statement estimates by using the CDF sample
.
The CONDDIST statement assigns the type “Observed” to these observed marginal CDF samples. For training data, the CONDDIST statement labels this CDF sample as “TrainObs” and assigns the average of the training response values as its response value. For test data, the CONDDIST statement labels this CDF sample as “testObs” and assigns the average of the test response values as its response value.
Let denote the distribution of
for the DATA= data set, and let
denote the marginal distribution of the explanatory covariates vector
for the TESTDATA= data set. Then the counterfactual marginal distribution of
for the TESTDATA= data set is defined as
The CONDDIST statement estimates by inverting the quantile function of the quantile process predictions for the response variable.
Let denote the quantile predictions for all the observations of the TESTDATA= data set. And let
denote the kth-smallest value in
. The CONDDIST statement estimates
by defining
The CONDDIST statement assigns the type "Fit and Pooled" and the label "TestFit" to this fitted marginal CDF sample, and assigns the average of the test response values as its response value.
Recall that the marginal CDF is also available for the TESTDATA= data set. For clarity, let denote the observable marginal CDF for the TESTDATA= data set. Then, by comparing the fitted marginal CDF sample
with the observed marginal CDF sample
, you can test whether the TESTDATA= data set follows the same model that is built on the DATA= data set. The CONDDIST statement supports the Mann-Whitney U test for this purpose when you use the MWU option.
The Mann-Whitney U test (also called the Wilcoxon rank-sum test) is a nonparametric two-sample test. Let denote the size of the first sample and
denote the size of the second sample. By merging the two samples into one ordered set
, the statistic of the Mann-Whitney U test is defined as
where is the rank of
. Under some regularity conditions, U asymptotically follows a normal distribution whose expectation equals
and whose variance equals
You can perform the Mann-Whitney U test against the observed marginal CDF sample of the response variable for the following data sets:
The null and alternative hypotheses of the Mann-Whitney U test for the CONDDIST statement are, respectively,
where denotes the response random variable for a CDF under the hypothesis test such as
or
, and
denotes the random variable for the observable marginal CDF for either the DATA= data set or the TESTDATA= data set.
You can specify the size of the CDF samples by using the MWU(SAMPSIZE=NOBS | NQ) option or the TESTDATA(MWU(SAMPSIZE=NOBS | NQ)) option in the CONDDIST statement. The TESTDATA(MWU(SAMPSIZE=NOBS | NQ)) option overrides the MWU(SAMPSIZE=NOBS | NQ) option.
The SAMPSIZE=NOBS suboption requests that the size of the TrainObs CDF sample, the TestFit CDF sample, and all the predicted observationwise CDF samples be equal to the number of training observations in the DATA= data set in the PROC QUANTREG statement; and it requests that the size of the TestObs CDF sample be equal to the number of testing observations in the TESTDATA= data set in the CONDDIST statement. The SAMPSIZE=NOBS suboption is appropriate if the size of the quantile-level grid is larger than the number of training observations in the DATA= data set in the PROC QUANTREG statement. Otherwise, the Mann-Whitney U test could output an imprecise p-value because the fitted CDF samples might not sufficiently represent the CDF information that the original training data provide.
The SAMPSIZE=NQ suboption requests that the size of the CDF samples for the MWU tests be equal to the size of the quantile-level grid. The SAMPSIZE=NQ suboption is appropriate if the size of the quantile-level grid is smaller than both the number of training observations in the DATA= data set in the PROC QUANTREG statement and the number of testing observations in the TESTDATA= data set in the CONDDIST statement. Otherwise, the Mann-Whitney U test could output a smaller p-value and incorrectly reject .
You can specify the MCDF option in the CONDDIST statement to perform the marginal distributions analysis by using a weighted bootstrap resampling method (Praestgaard and Wellner 1993). The bootstrap method iterates the following steps:
Assign a probability weight to each of the training observations
in the DATA= data set in the QUANTREG statement. If you specify weights
by using the WEIGHT statement,
is generated from the Gamma
distribution and normalized by using the factor
, such that the finally assigned
satisfies
. Otherwise,
is generated from the Exponential
distribution and normalized by using
, such that
, where n is the total number of training observations.
Compute the following statistics by using the -weighted training observations:
Quantile process regression model: by using the fast quantile process regression algorithm. You can use the FQPR(N=n) option in the MODEL statement to specify the quantile-level grid for your quantile process model.
The TrainObs marginal CDF sample for the -weighted response variable. Let
denote the sorted response values for the training data; let
denote their associated normalized bootstrap weights; and define
. Then the following formula computes a size-m TrainObs marginal CDF sample:
For the estimates in the marginal distribution comparison table, the size of the TrainObs marginal CDF sample equals the number of training observations. For the marginal CDF plot, the size of the TrainObs marginal CDF sample equals the number of quantile levels in the fast quantile process regression algorithm.
If you specify the TESTDATA= data set in the CONDDIST statement, the weighted bootstrap method also iterates the following steps:
Assign a probability weight to each of the testing observations
by using the same method that is described in Step 1 for the TESTDATA= data set.
Compute quantile predictions for all the testing observations by using the quantile process regression model fitted in Step 2a.
Compute the following statistics:
The TestObs marginal CDF sample for the -weighted response variable. This marginal CDF sample is computed by using the same formula as in Step 2b but on the sorted testing response values.
The TestFit marginal CDF sample for the pooled quantile predictions from Step 5. This marginal CDF sample is computed by using the same formula as in Step 2b but on the sorted pooled quantile predictions, where the weight of each quantile prediction equals the weight of its associated testing observation. You can specify the QRNG= suboption in the MCDF option to specify the limits of the quantile predictions. By default, specifying QRNG=BOTH limits the quantile predictions to be within the range of observed response values that combines the training data and the testing data. If a quantile prediction is smaller than the minimum value of all the training and testing response values, then the quantile prediction would be reset to the minimum value. Similarly, if a quantile prediction is larger than the maximum value of all the training and testing response values, then the quantile prediction is reset to the maximum value.
The CDF comparison scores, including the MWU test, mean difference, and median difference.
For the estimates in the marginal distribution comparison table, the size of the TestObs and TestFit marginal CDF samples equals the number of testing observations. For the marginal CDF plot, the size of the TestObs and TestFit marginal CDF samples equals the number of quantile levels in the fast quantile process regression algorithm.
The NREP=n suboption in the MCDF option specifies the total number of bootstrap iterations. Let denote the statistic that is computed in the sth bootstrap iteration for a parameter
. This
could be a point in an empirical quantile function
, a MWU test score, or any other statistics from Step 2 or Step 6. Accordingly,
and
, respectively, represent the total cumulative sum and the total cumulative sum of squares of the relevant statistic for all the bootstrap iterations that are cumulated in Step 3 and Step 7. Then the formulas in ???? calculate the final bootstrap estimate and its standard error, Z score, p-value, and confidence limits for each of the relevant statistics.
| Statistic | Formula |
|---|---|
| Bootstrap estimate: | |
| Standard error: | |
| Z score: | |
| Two-sided p-value: | |
| Confidence limits: |
In the marginal CDF plot, the confidence limits of a CDF sample are the inverse functions of the confidence limits of its associated empirical quantile function.
Given a response value y and a CDF sample for :
, the CONDDIST statement estimates the quantile level of y on
by using the following linear interpolation method:
Here is defined as a regression quantile level if the CDF sample is conditional for
, or a sample quantile level if the CDF sample is observed and marginal for
.
The PLOT=PPPLOT option in the CONDDIST statement creates the scatter plot for the regression quantile levels versus the sample quantile levels.
The CONDDIST statement estimates the PDF by applying the kernel density estimator to the estimated CDF in the quantile-level grid. An estimated CDF for the CONDDIST statement is in the form of . The lower end of the estimated PDF is limited by the quantile level
, and the upper end of the estimated PDF is limited by the quantile level
.
The general form of the kernel density estimator is
where
The KDE option provides three kernel functions (): normal, quadratic, and triangular. You can specify the function by using the K= kernel-option in parentheses after the KDE option. The values of the K= option are NORMAL, QUADRATIC, and TRIANGULAR. By default, a normal kernel is used. The formulas for the kernel functions are as follows:
The value of , referred to as the bandwidth parameter, determines the degree of smoothness in the estimated probability density function. You specify
indirectly when you specify a standardized bandwidth c by using the C=value suboption. Let Q denote the interquartile range and q denote the sample size; then c is related to
by the formula
For a specific kernel function, the discrepancy between the density estimator and the true density
is measured by the mean integrated square error (MISE):
MISE is the sum of the integrated squared bias and the variance. An approximate mean integrated square error (AMISE) is defined as
You can derive a bandwidth that minimizes AMISE by treating as the normal density whose parameters
and
are estimated by the sample mean and standard deviation, respectively. If you do not specify a bandwidth parameter or if you specify C=MISE, the bandwidth that minimizes AMISE is used. The value of AMISE can be used to compare different density estimates. You can also specify C=SJPI to select the bandwidth by using the Sheather-Jones plug-in method (Jones, Marron, and Sheather 1996).
The general kernel density estimates assume that the domain of the density to be estimated can take all values on a real line. However, sometimes the domain of a density is an interval that is bounded on one or both sides. For example, if a variable Y is a measurement of only positive values, then the kernel density curve should be bounded so that it is zero for negative Y values. You can use the LOWER= and UPPER= kde-options in the PDF= option in the CONDDIST statement to specify the bounds.