The QUANTREG Procedure

Estimating Probability Functions by Using the CONDDIST Statement

Because the cumulative distribution function (CDF) is the inverse of the quantile function, you can estimate the CDF and other CDF-based statistics by inverting the relevant quantile function estimates. The CONDDIST statement in the QUANTREG procedure performs this type of analysis by estimating the conditional and marginal probability functions for the response random variable . These probability functions include the following:

for the conditional CDF and for the conditional probability density function (PDF) that can be estimated using the model
for observable marginal CDF and for the observable marginal PDF that can be estimated from the observed response values without using the model
for the predictable marginal CDF and for the counterfactual marginal PDF that can be estimated from the quantile process predictions of the response variable and with the covariates integrated out

The QUANTREG procedure estimates the conditional quantile function, , by using the fast quantile process regression (FQPR) algorithm. If you specify the QUANTILE= FQPR option in the MODEL statement, the CONDDIST statement uses the FQPR parameter estimates to estimate . For more information about the FQPR option, see the section Fast Quantile Process Regression. If the QUANTILE=FQPR option is not used in the MODEL statement, the first CONDDIST statement uses the FQPR algorithm to compute for all the CONDDIST statements. This is equal to the that is output from the QUANTILE=FQPR option when you use the default FQPR suboption values.

For purposes of conceptual and computational simplicity, the CONDDIST statement estimates the quantile functions on a quantile-level grid , where is the lower end of the grid and is the step length of the grid. You can specify and q by respectively using the L=value and N=n suboptions of the QUANTILE= FQPR option in the MODEL statement. You can also specify the upper end of the grid by using the U=value suboption of the QUANTILE=FQPR option in the MODEL statement, such that

tau Subscript s Baseline equals StartFraction tau Subscript q Baseline minus tau 1 Over q minus 1 EndFraction

By default, the size of the grid q is the smaller number between 100 and half the number of training observations in the DATA= data set. And by default, , , and .

The estimated quantile function is also called a CDF sample for .

Counterfactual Distributions for the TESTDATA= Data Set

The QUANTREG procedure computes the quantile function estimates on the DATA= data set in the PROC QUANTREG statement. To test this for more observations, you can specify a separate data set by using the TESTDATA= data set option in the CONDDIST statement. Without assuming that the TESTDATA= data set and the DATA= data set share the same conditional distribution for , the CONDDIST statement estimates counterfactual probability functions for the response variable of the TESTDATA= data set that impose the quantile function of the DATA= data set on the TESTDATA= data set.

Conditional Cumulative Distribution Functions

The CONDDIST statement estimates by inverting the estimated quantile function on the quantile-level grid , such that the estimated satisfies

ModifyingAbove upper F With caret Subscript bold upper Y vertical-bar bold x Baseline left-parenthesis ModifyingAbove upper Q With caret Subscript bold upper Y vertical-bar bold x Baseline left-parenthesis tau Subscript j Baseline right-parenthesis right-parenthesis equals tau Subscript i Baseline for j equals 1 comma ellipsis comma q

By definition, a quantile function must be nondecreasing, so that for any . However, quantile regression estimates could result in crossed quantile predictions. In other words, it is possible that for some and some . This predicament is called crossing. To avoid crossing, the CONDDIST statement defines as the jth-smallest prediction among . Therefore, it is possible that for some and .

For an individual observation, the CONDDIST statement assigns the type “Fit for Obs” to this conditional CDF sample and labels this sample by using the observation ID value (if available) or the observation index.

Conditional Cumulative Distribution Functions at Average

You can also request the conditional CDF sample at average for by using the SHOWAVG option for the training average or the TESTDATA(SHOWAVG)= option for the test average (or both).

The CONDDIST statement assigns the type “Fit at Average” to these conditional CDF samples. For training data, the CONDDIST statement labels this CDF sample as “TrainAvg” and assigns the average of the training response values as its response value. For test data, the CONDDIST statement labels this CDF sample as “TestAvg” and assigns the average of the test response values as its response value.

Observed Marginal Cumulative Distribution Functions

The CONDDIST statement estimates the observable marginal by inverting the empirical quantile function of the response variable.

Let denote the sorted response values for either the training data that you can specify by using the DATA= data set in the PROC QUANTREG statement or the test data that you can specify by using the TESTDATA= data set in the CONDDIST statement. If you assign the quantile level to the response value , then is defined as

ModifyingAbove upper Q With caret Subscript bold upper Y Baseline left-parenthesis tau right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column y Subscript left-parenthesis 1 right-parenthesis Baseline 2nd Column if tau element-of left-bracket 0 comma 0.5 slash n right-parenthesis 2nd Row 1st Column lamda y Subscript left-parenthesis i plus 1 right-parenthesis Baseline plus left-parenthesis 1 minus lamda right-parenthesis y Subscript left-parenthesis i right-parenthesis Baseline 2nd Column if tau element-of left-bracket left-parenthesis i minus 0.5 right-parenthesis slash n comma left-parenthesis i plus 0.5 right-parenthesis slash n right-parenthesis for lamda equals n tau minus left-parenthesis i minus 0.5 right-parenthesis 3rd Row 1st Column y Subscript left-parenthesis n right-parenthesis Baseline 2nd Column if tau element-of left-bracket left-parenthesis n minus 0.5 right-parenthesis slash n comma 1 right-bracket EndLayout

To be consistent with the quantile-level grid that is used for estimating the conditional probability functions, the CONDDIST statement estimates by using the CDF sample .

The CONDDIST statement assigns the type “Observed” to these observed marginal CDF samples. For training data, the CONDDIST statement labels this CDF sample as “TrainObs” and assigns the average of the training response values as its response value. For test data, the CONDDIST statement labels this CDF sample as “testObs” and assigns the average of the test response values as its response value.

Fitted Marginal Cumulative Distribution Functions

Let denote the distribution of for the DATA= data set, and let denote the marginal distribution of the explanatory covariates vector for the TESTDATA= data set. Then the counterfactual marginal distribution of for the TESTDATA= data set is defined as

upper F Subscript bold upper P Superscript asterisk Baseline left-parenthesis t right-parenthesis equals integral upper F Subscript bold upper Y vertical-bar bold upper X equals bold x Baseline left-parenthesis t right-parenthesis d upper F Subscript bold upper X Superscript asterisk Baseline left-parenthesis bold x right-parenthesis

The CONDDIST statement estimates by inverting the quantile function of the quantile process predictions for the response variable.

Let denote the quantile predictions for all the observations of the TESTDATA= data set. And let denote the kth-smallest value in . The CONDDIST statement estimates by defining

ModifyingAbove upper Q With caret Subscript bold upper P Superscript asterisk Baseline left-parenthesis tau Subscript j Baseline right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column p Subscript left-parenthesis j n minus 0.5 n plus 0.5 right-parenthesis Baseline 2nd Column if n is an odd number 2nd Row 1st Column 0.5 p Subscript left-parenthesis j n minus 0.5 n right-parenthesis Baseline plus 0.5 p Subscript left-parenthesis j n minus 0.5 n plus 1 right-parenthesis Baseline 2nd Column if n is an even number EndLayout

where for .

The CONDDIST statement assigns the type "Fit and Pooled" and the label "TestFit" to this fitted marginal CDF sample, and assigns the average of the test response values as its response value.

Recall that the marginal CDF is also available for the TESTDATA= data set. For clarity, let denote the observable marginal CDF for the TESTDATA= data set. Then, by comparing the fitted marginal CDF sample with the observed marginal CDF sample , you can test whether the TESTDATA= data set follows the same model that is built on the DATA= data set. The CONDDIST statement supports the Mann-Whitney U test for this purpose when you use the MWU option.

Mann-Whitney U Test

The Mann-Whitney U test (also called the Wilcoxon rank-sum test) is a nonparametric two-sample test. Let denote the size of the first sample and denote the size of the second sample. By merging the two samples into one ordered set , the statistic of the Mann-Whitney U test is defined as

upper U equals sigma-summation Underscript j equals 1 Overscript n Endscripts c Subscript j Baseline upper R Subscript j Baseline with c Subscript j Baseline equals StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column if a Subscript j Baseline belongs to the first sample 2nd Row 1st Column 0 2nd Column if a Subscript j Baseline belongs to the second sample EndLayout

where is the rank of . Under some regularity conditions, U asymptotically follows a normal distribution whose expectation equals

upper E left-parenthesis upper U right-parenthesis equals StartFraction n 1 Over n EndFraction sigma-summation Underscript j equals 1 Overscript n Endscripts upper R Subscript j Baseline equals StartFraction n 1 left-parenthesis n plus 1 right-parenthesis Over 2 EndFraction

and whose variance equals

normal upper V normal a normal r left-parenthesis upper U right-parenthesis equals StartFraction n 1 n 2 left-parenthesis n plus 1 right-parenthesis Over 12 EndFraction

You can perform the Mann-Whitney U test against the observed marginal CDF sample of the response variable for the following data sets:

the DATA= data set in the PROC QUANTREG statement (by using the MWU option in the CONDDIST statement)
the TESTDATA= data set in the CONDDIST statement (by using the TESTDATA(MWU)= data set option in the CONDDIST statement)

The null and alternative hypotheses of the Mann-Whitney U test for the CONDDIST statement are, respectively,

where denotes the response random variable for a CDF under the hypothesis test such as or , and denotes the random variable for the observable marginal CDF for either the DATA= data set or the TESTDATA= data set.

You can specify the size of the CDF samples by using the MWU(SAMPSIZE=NOBS | NQ) option or the TESTDATA(MWU(SAMPSIZE=NOBS | NQ)) option in the CONDDIST statement. The TESTDATA(MWU(SAMPSIZE=NOBS | NQ)) option overrides the MWU(SAMPSIZE=NOBS | NQ) option.

The SAMPSIZE=NOBS suboption requests that the size of the TrainObs CDF sample, the TestFit CDF sample, and all the predicted observationwise CDF samples be equal to the number of training observations in the DATA= data set in the PROC QUANTREG statement; and it requests that the size of the TestObs CDF sample be equal to the number of testing observations in the TESTDATA= data set in the CONDDIST statement. The SAMPSIZE=NOBS suboption is appropriate if the size of the quantile-level grid is larger than the number of training observations in the DATA= data set in the PROC QUANTREG statement. Otherwise, the Mann-Whitney U test could output an imprecise p-value because the fitted CDF samples might not sufficiently represent the CDF information that the original training data provide.

The SAMPSIZE=NQ suboption requests that the size of the CDF samples for the MWU tests be equal to the size of the quantile-level grid. The SAMPSIZE=NQ suboption is appropriate if the size of the quantile-level grid is smaller than both the number of training observations in the DATA= data set in the PROC QUANTREG statement and the number of testing observations in the TESTDATA= data set in the CONDDIST statement. Otherwise, the Mann-Whitney U test could output a smaller p-value and incorrectly reject .

Marginal Distribution Analysis Using the Bootstrap Resampling Method

You can specify the MCDF option in the CONDDIST statement to perform the marginal distributions analysis by using a weighted bootstrap resampling method (Praestgaard and Wellner 1993). The bootstrap method iterates the following steps:

Assign a probability weight to each of the training observations in the DATA= data set in the QUANTREG statement. If you specify weights by using the WEIGHT statement, is generated from the Gamma distribution and normalized by using the factor , such that the finally assigned satisfies . Otherwise, is generated from the Exponential distribution and normalized by using , such that , where n is the total number of training observations.
Compute the following statistics by using the -weighted training observations:
1. Quantile process regression model: by using the fast quantile process regression algorithm. You can use the FQPR(N=n) option in the MODEL statement to specify the quantile-level grid for your quantile process model.
2. The TrainObs marginal CDF sample for the -weighted response variable. Let denote the sorted response values for the training data; let denote their associated normalized bootstrap weights; and define . Then the following formula computes a size-m TrainObs marginal CDF sample:
  
  For the estimates in the marginal distribution comparison table, the size of the TrainObs marginal CDF sample equals the number of training observations. For the marginal CDF plot, the size of the TrainObs marginal CDF sample equals the number of quantile levels in the fast quantile process regression algorithm.

If you specify the TESTDATA= data set in the CONDDIST statement, the weighted bootstrap method also iterates the following steps:

Assign a probability weight to each of the testing observations by using the same method that is described in Step 1 for the TESTDATA= data set.
Compute quantile predictions for all the testing observations by using the quantile process regression model fitted in Step 2a.
Compute the following statistics:
1. The TestObs marginal CDF sample for the -weighted response variable. This marginal CDF sample is computed by using the same formula as in Step 2b but on the sorted testing response values.
2. The TestFit marginal CDF sample for the pooled quantile predictions from Step 5. This marginal CDF sample is computed by using the same formula as in Step 2b but on the sorted pooled quantile predictions, where the weight of each quantile prediction equals the weight of its associated testing observation. You can specify the QRNG= suboption in the MCDF option to specify the limits of the quantile predictions. By default, specifying QRNG=BOTH limits the quantile predictions to be within the range of observed response values that combines the training data and the testing data. If a quantile prediction is smaller than the minimum value of all the training and testing response values, then the quantile prediction would be reset to the minimum value. Similarly, if a quantile prediction is larger than the maximum value of all the training and testing response values, then the quantile prediction is reset to the maximum value.
3. The CDF comparison scores, including the MWU test, mean difference, and median difference.
For the estimates in the marginal distribution comparison table, the size of the TestObs and TestFit marginal CDF samples equals the number of testing observations. For the marginal CDF plot, the size of the TestObs and TestFit marginal CDF samples equals the number of quantile levels in the fast quantile process regression algorithm.

The NREP=n suboption in the MCDF option specifies the total number of bootstrap iterations. Let denote the statistic that is computed in the sth bootstrap iteration for a parameter . This could be a point in an empirical quantile function , a MWU test score, or any other statistics from Step 2 or Step 6. Accordingly, and , respectively, represent the total cumulative sum and the total cumulative sum of squares of the relevant statistic for all the bootstrap iterations that are cumulated in Step 3 and Step 7. Then the formulas in ???? calculate the final bootstrap estimate and its standard error, Z score, p-value, and confidence limits for each of the relevant statistics.

Statistic	Formula
Bootstrap estimate:
Standard error:
Z score:
Two-sided p-value:
Confidence limits:

In the marginal CDF plot, the confidence limits of a CDF sample are the inverse functions of the confidence limits of its associated empirical quantile function.

Regression Quantile Level and Sample Quantile Level

Given a response value y and a CDF sample for : , the CONDDIST statement estimates the quantile level of y on by using the following linear interpolation method:

ModifyingAbove tau With caret Subscript y Baseline StartLayout Enlarged left-brace 1st Row 1st Column less-than tau 1 2nd Column if y less-than ModifyingAbove upper Q With caret left-parenthesis tau 1 right-parenthesis 2nd Row 1st Column equals lamda tau Subscript i plus 1 Baseline plus left-parenthesis 1 minus lamda right-parenthesis tau Subscript i Baseline 2nd Column if ModifyingAbove upper Q With caret left-parenthesis tau Subscript i Baseline right-parenthesis less-than-or-equal-to y less-than-or-equal-to ModifyingAbove upper Q With caret left-parenthesis tau Subscript i plus 1 Baseline right-parenthesis for lamda equals left-parenthesis y minus ModifyingAbove upper Q With caret left-parenthesis tau Subscript i Baseline right-parenthesis right-parenthesis slash left-parenthesis ModifyingAbove upper Q With caret left-parenthesis tau Subscript i plus 1 Baseline right-parenthesis minus ModifyingAbove upper Q With caret left-parenthesis tau Subscript i Baseline right-parenthesis right-parenthesis 3rd Row 1st Column greater-than tau Subscript q Baseline 2nd Column if y greater-than ModifyingAbove upper Q With caret left-parenthesis tau Subscript q Baseline right-parenthesis EndLayout

Here is defined as a regression quantile level if the CDF sample is conditional for , or a sample quantile level if the CDF sample is observed and marginal for .

The PLOT=PPPLOT option in the CONDDIST statement creates the scatter plot for the regression quantile levels versus the sample quantile levels.

Probability Density Functions

The CONDDIST statement estimates the PDF by applying the kernel density estimator to the estimated CDF in the quantile-level grid. An estimated CDF for the CONDDIST statement is in the form of . The lower end of the estimated PDF is limited by the quantile level , and the upper end of the estimated PDF is limited by the quantile level .

The general form of the kernel density estimator is

ModifyingAbove f With caret Subscript lamda Baseline left-parenthesis t right-parenthesis equals StartFraction 1 Over q lamda l EndFraction sigma-summation Underscript j equals 1 Overscript q Endscripts upper K 0 left-parenthesis StartFraction t minus t Subscript j Baseline Over lamda EndFraction right-parenthesis

where

is the kernel function
is the bandwidth
q is the sample size
is the ith value
is the length of the range of the quantile-level grid

The KDE option provides three kernel functions (): normal, quadratic, and triangular. You can specify the function by using the K= kernel-option in parentheses after the KDE option. The values of the K= option are NORMAL, QUADRATIC, and TRIANGULAR. By default, a normal kernel is used. The formulas for the kernel functions are as follows:

StartLayout 1st Row 1st Column Normal 2nd Column upper K 0 left-parenthesis t right-parenthesis equals StartFraction 1 Over StartRoot 2 pi EndRoot EndFraction exp left-parenthesis minus one-half t squared right-parenthesis 3rd Column for negative normal infinity less-than t less-than normal infinity 2nd Row 1st Column Quadratic 2nd Column upper K 0 left-parenthesis t right-parenthesis equals three-fourths left-parenthesis 1 minus t squared right-parenthesis 3rd Column for StartAbsoluteValue t EndAbsoluteValue less-than-or-equal-to 1 3rd Row 1st Column Triangular 2nd Column upper K 0 left-parenthesis t right-parenthesis equals 1 minus StartAbsoluteValue t EndAbsoluteValue 3rd Column for StartAbsoluteValue t EndAbsoluteValue less-than-or-equal-to 1 EndLayout

The value of , referred to as the bandwidth parameter, determines the degree of smoothness in the estimated probability density function. You specify indirectly when you specify a standardized bandwidth c by using the C=value suboption. Let Q denote the interquartile range and q denote the sample size; then c is related to by the formula

lamda equals c upper Q q Superscript negative one-fifth

For a specific kernel function, the discrepancy between the density estimator and the true density is measured by the mean integrated square error (MISE):

MISE left-parenthesis lamda right-parenthesis equals integral Underscript x Endscripts StartSet upper E left-parenthesis ModifyingAbove f With caret Subscript lamda Baseline left-parenthesis x right-parenthesis right-parenthesis minus f left-parenthesis x right-parenthesis EndSet squared d x plus integral Underscript x Endscripts normal upper V normal a normal r left-parenthesis ModifyingAbove f With caret Subscript lamda Baseline left-parenthesis x right-parenthesis right-parenthesis d x

MISE is the sum of the integrated squared bias and the variance. An approximate mean integrated square error (AMISE) is defined as

AMISE left-parenthesis lamda right-parenthesis equals one-fourth lamda Superscript 4 Baseline left-parenthesis integral Underscript t Endscripts t squared upper K left-parenthesis t right-parenthesis d t right-parenthesis squared integral Underscript x Endscripts left-parenthesis f double-prime left-parenthesis x right-parenthesis right-parenthesis squared d x plus StartFraction 1 Over q lamda EndFraction integral Underscript t Endscripts upper K left-parenthesis t right-parenthesis squared d t

You can derive a bandwidth that minimizes AMISE by treating as the normal density whose parameters and are estimated by the sample mean and standard deviation, respectively. If you do not specify a bandwidth parameter or if you specify C=MISE, the bandwidth that minimizes AMISE is used. The value of AMISE can be used to compare different density estimates. You can also specify C=SJPI to select the bandwidth by using the Sheather-Jones plug-in method (Jones, Marron, and Sheather 1996).

The general kernel density estimates assume that the domain of the density to be estimated can take all values on a real line. However, sometimes the domain of a density is an interval that is bounded on one or both sides. For example, if a variable Y is a measurement of only positive values, then the kernel density curve should be bounded so that it is zero for negative Y values. You can use the LOWER= and UPPER= kde-options in the PDF= option in the CONDDIST statement to specify the bounds.

Last updated: December 09, 2022