The QUANTREG Procedure

Estimating Probability Functions by Using the CONDDIST Statement

Because the cumulative distribution function (CDF) is the inverse of the quantile function, you can estimate the CDF and other CDF-based statistics by inverting the relevant quantile function estimates. The CONDDIST statement in the QUANTREG procedure performs this type of analysis by estimating the conditional and marginal probability functions for the response random variable bold upper Y. These probability functions include the following:

  • upper F Subscript bold upper Y vertical-bar bold x Baseline left-parenthesis t right-parenthesis for the conditional CDF and upper P Subscript bold upper Y vertical-bar bold x Baseline left-parenthesis t right-parenthesis for the conditional probability density function (PDF) that can be estimated using the model

  • upper F Subscript bold upper Y Baseline left-parenthesis t right-parenthesis for observable marginal CDF and upper P Subscript bold upper Y Baseline left-parenthesis t right-parenthesis for the observable marginal PDF that can be estimated from the observed response values without using the model

  • upper F Subscript bold upper P Baseline left-parenthesis t right-parenthesis for the predictable marginal CDF and upper P Subscript bold upper P Baseline left-parenthesis t right-parenthesis for the counterfactual marginal PDF that can be estimated from the quantile process predictions of the response variable and with the covariates bold x integrated out

The QUANTREG procedure estimates the conditional quantile function, upper Q Subscript bold upper Y vertical-bar bold x Baseline left-parenthesis tau right-parenthesis equals bold x prime bold-italic beta left-parenthesis tau right-parenthesis, by using the fast quantile process regression (FQPR) algorithm. If you specify the QUANTILE= FQPR option in the MODEL statement, the CONDDIST statement uses the FQPR parameter estimates ModifyingAbove bold-italic beta With caret left-parenthesis tau right-parenthesis to estimate bold-italic beta left-parenthesis tau right-parenthesis. For more information about the FQPR option, see the section Fast Quantile Process Regression. If the QUANTILE=FQPR option is not used in the MODEL statement, the first CONDDIST statement uses the FQPR algorithm to compute ModifyingAbove bold-italic beta With caret left-parenthesis tau right-parenthesis for all the CONDDIST statements. This ModifyingAbove bold-italic beta With caret left-parenthesis tau right-parenthesis is equal to the ModifyingAbove bold-italic beta With caret left-parenthesis tau right-parenthesis that is output from the QUANTILE=FQPR option when you use the default FQPR suboption values.

For purposes of conceptual and computational simplicity, the CONDDIST statement estimates the quantile functions upper Q left-parenthesis tau right-parenthesis on a quantile-level grid StartSet tau Subscript j Baseline equals tau 1 plus left-parenthesis j minus 1 right-parenthesis tau Subscript s Baseline colon j equals 1 comma ellipsis comma q EndSet, where tau 1 is the lower end of the grid and tau Subscript s is the step length of the grid. You can specify tau 1 and q by respectively using the L=value and N=n suboptions of the QUANTILE= FQPR option in the MODEL statement. You can also specify the upper end of the grid tau Subscript q by using the U=value suboption of the QUANTILE=FQPR option in the MODEL statement, such that

tau Subscript s Baseline equals StartFraction tau Subscript q Baseline minus tau 1 Over q minus 1 EndFraction

By default, the size of the grid q is the smaller number between 100 and half the number of training observations in the DATA= data set. And by default, tau 1 equals 0.5 slash q, tau Subscript q Baseline equals 1 minus 0.5 slash q, and tau Subscript s Baseline equals 1 slash q.

The estimated quantile function StartSet ModifyingAbove upper Q With caret left-parenthesis tau 1 right-parenthesis comma ellipsis comma ModifyingAbove upper Q With caret left-parenthesis tau Subscript q Baseline right-parenthesis EndSet is also called a CDF sample for upper F left-parenthesis t right-parenthesis.

Counterfactual Distributions for the TESTDATA= Data Set

The QUANTREG procedure computes the quantile function estimates ModifyingAbove upper Q With caret Subscript bold upper Y vertical-bar bold x Baseline left-parenthesis tau right-parenthesis equals bold x prime ModifyingAbove bold-italic beta With caret left-parenthesis tau right-parenthesis on the DATA= data set in the PROC QUANTREG statement. To test this ModifyingAbove upper Q With caret Subscript bold upper Y vertical-bar bold x Baseline left-parenthesis tau right-parenthesis for more observations, you can specify a separate data set by using the TESTDATA= data set option in the CONDDIST statement. Without assuming that the TESTDATA= data set and the DATA= data set share the same conditional distribution for bold upper Y vertical-bar bold upper X, the CONDDIST statement estimates counterfactual probability functions for the response variable bold upper Y of the TESTDATA= data set that impose the quantile function upper Q Subscript bold upper Y vertical-bar bold x Baseline left-parenthesis tau right-parenthesis of the DATA= data set on the TESTDATA= data set.

Conditional Cumulative Distribution Functions

The CONDDIST statement estimates upper F Subscript bold upper Y vertical-bar bold x Baseline left-parenthesis t right-parenthesis by inverting the estimated quantile function on the quantile-level grid StartSet ModifyingAbove upper Q With caret Subscript bold upper Y vertical-bar bold x Baseline left-parenthesis tau 1 right-parenthesis comma ellipsis comma ModifyingAbove upper Q With caret Subscript bold upper Y vertical-bar bold x Baseline left-parenthesis tau Subscript q Baseline right-parenthesis EndSet, such that the estimated upper F Subscript bold upper Y vertical-bar bold x Baseline left-parenthesis t right-parenthesis satisfies

ModifyingAbove upper F With caret Subscript bold upper Y vertical-bar bold x Baseline left-parenthesis ModifyingAbove upper Q With caret Subscript bold upper Y vertical-bar bold x Baseline left-parenthesis tau Subscript j Baseline right-parenthesis right-parenthesis equals tau Subscript i Baseline for j equals 1 comma ellipsis comma q

By definition, a quantile function upper Q left-parenthesis tau right-parenthesis must be nondecreasing, so that upper Q left-parenthesis tau 1 right-parenthesis less-than-or-equal-to upper Q left-parenthesis tau 2 right-parenthesis for any 0 less-than-or-equal-to tau 1 less-than tau 2 less-than-or-equal-to 1. However, quantile regression estimates could result in crossed quantile predictions. In other words, it is possible that bold x prime ModifyingAbove bold-italic beta With caret left-parenthesis tau 1 right-parenthesis greater-than bold x prime ModifyingAbove bold-italic beta With caret left-parenthesis tau 2 right-parenthesis for some bold x and some 0 less-than-or-equal-to tau 1 less-than tau 2 less-than-or-equal-to 1. This predicament is called crossing. To avoid crossing, the CONDDIST statement defines ModifyingAbove upper Q With caret Subscript bold upper Y vertical-bar bold x Baseline left-parenthesis tau Subscript j Baseline right-parenthesis as the jth-smallest prediction among StartSet bold x prime ModifyingAbove bold-italic beta With caret left-parenthesis tau 1 right-parenthesis comma ellipsis comma bold x prime ModifyingAbove bold-italic beta With caret left-parenthesis tau Subscript q Baseline right-parenthesis EndSet. Therefore, it is possible that ModifyingAbove upper Q With caret Subscript bold upper Y vertical-bar bold x Baseline left-parenthesis tau Subscript j Baseline right-parenthesis not-equals bold x prime ModifyingAbove bold-italic beta With caret left-parenthesis tau Subscript j Baseline right-parenthesis for some bold x and tau Subscript j.

For an individual observation, the CONDDIST statement assigns the type “Fit for Obs” to this conditional CDF sample and labels this sample by using the observation ID value (if available) or the observation index.

Conditional Cumulative Distribution Functions at Average

You can also request the conditional CDF sample at average bold x overbar equals StartFraction 1 Over n EndFraction sigma-summation Underscript i equals 1 Overscript n Endscripts bold x Subscript i for upper F Subscript bold upper Y vertical-bar bold x overbar Baseline left-parenthesis t right-parenthesis by using the SHOWAVG option for the training average or the TESTDATA(SHOWAVG)= option for the test average (or both).

The CONDDIST statement assigns the type “Fit at Average” to these conditional CDF samples. For training data, the CONDDIST statement labels this CDF sample as “TrainAvg” and assigns the average of the training response values y overbar equals StartFraction 1 Over n EndFraction sigma-summation Underscript i equals 1 Overscript n Endscripts y Subscript i as its response value. For test data, the CONDDIST statement labels this CDF sample as “TestAvg” and assigns the average of the test response values as its response value.

Observed Marginal Cumulative Distribution Functions

The CONDDIST statement estimates the observable marginal upper F Subscript bold upper Y Baseline left-parenthesis t right-parenthesis by inverting the empirical quantile function ModifyingAbove upper Q With caret Subscript bold upper Y Baseline left-parenthesis tau right-parenthesis of the response variable.

Let y Subscript left-parenthesis 1 right-parenthesis Baseline less-than-or-equal-to y Subscript left-parenthesis 2 right-parenthesis Baseline less-than-or-equal-to midline-horizontal-ellipsis less-than-or-equal-to y Subscript left-parenthesis n right-parenthesis denote the sorted response values for either the training data that you can specify by using the DATA= data set in the PROC QUANTREG statement or the test data that you can specify by using the TESTDATA= data set in the CONDDIST statement. If you assign the quantile level left-parenthesis i minus 0.5 right-parenthesis slash n to the response value y Subscript left-parenthesis i right-parenthesis, then ModifyingAbove upper Q With caret Subscript bold upper Y Baseline left-parenthesis tau right-parenthesis is defined as

ModifyingAbove upper Q With caret Subscript bold upper Y Baseline left-parenthesis tau right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column y Subscript left-parenthesis 1 right-parenthesis Baseline 2nd Column if tau element-of left-bracket 0 comma 0.5 slash n right-parenthesis 2nd Row 1st Column lamda y Subscript left-parenthesis i plus 1 right-parenthesis Baseline plus left-parenthesis 1 minus lamda right-parenthesis y Subscript left-parenthesis i right-parenthesis Baseline 2nd Column if tau element-of left-bracket left-parenthesis i minus 0.5 right-parenthesis slash n comma left-parenthesis i plus 0.5 right-parenthesis slash n right-parenthesis for lamda equals n tau minus left-parenthesis i minus 0.5 right-parenthesis 3rd Row 1st Column y Subscript left-parenthesis n right-parenthesis Baseline 2nd Column if tau element-of left-bracket left-parenthesis n minus 0.5 right-parenthesis slash n comma 1 right-bracket EndLayout

To be consistent with the quantile-level grid that is used for estimating the conditional probability functions, the CONDDIST statement estimates upper Q Subscript bold upper Y Baseline left-parenthesis tau right-parenthesis by using the CDF sample StartSet ModifyingAbove upper Q With caret Subscript upper Y Baseline left-parenthesis tau 1 right-parenthesis comma ellipsis comma ModifyingAbove upper Q With caret Subscript upper Y Baseline left-parenthesis tau Subscript q Baseline right-parenthesis EndSet.

The CONDDIST statement assigns the type “Observed” to these observed marginal CDF samples. For training data, the CONDDIST statement labels this CDF sample as “TrainObs” and assigns the average of the training response values as its response value. For test data, the CONDDIST statement labels this CDF sample as “testObs” and assigns the average of the test response values as its response value.

Fitted Marginal Cumulative Distribution Functions

Let upper F Subscript bold upper Y vertical-bar bold upper X denote the distribution of bold upper Y vertical-bar bold upper X for the DATA= data set, and let upper F Subscript bold upper X Superscript asterisk denote the marginal distribution of the explanatory covariates vector bold upper X for the TESTDATA= data set. Then the counterfactual marginal distribution of bold upper Y for the TESTDATA= data set is defined as

upper F Subscript bold upper P Superscript asterisk Baseline left-parenthesis t right-parenthesis equals integral upper F Subscript bold upper Y vertical-bar bold upper X equals bold x Baseline left-parenthesis t right-parenthesis d upper F Subscript bold upper X Superscript asterisk Baseline left-parenthesis bold x right-parenthesis

The CONDDIST statement estimates upper F Subscript bold upper P Superscript asterisk Baseline left-parenthesis t right-parenthesis by inverting the quantile function of the quantile process predictions for the response variable.

Let StartSet p Subscript i j Baseline equals bold x prime Subscript i Baseline ModifyingAbove bold-italic beta With caret left-parenthesis tau Subscript j Baseline right-parenthesis colon i equals 1 comma ellipsis comma n and j equals 1 comma ellipsis comma q EndSet denote the quantile predictions for all the observations of the TESTDATA= data set. And let p Subscript left-parenthesis k right-parenthesis denote the kth-smallest value in StartSet p Subscript i j Baseline EndSet. The CONDDIST statement estimates upper Q Subscript bold upper P Superscript asterisk Baseline left-parenthesis tau right-parenthesis by defining

ModifyingAbove upper Q With caret Subscript bold upper P Superscript asterisk Baseline left-parenthesis tau Subscript j Baseline right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column p Subscript left-parenthesis j n minus 0.5 n plus 0.5 right-parenthesis Baseline 2nd Column if n is an odd number 2nd Row 1st Column 0.5 p Subscript left-parenthesis j n minus 0.5 n right-parenthesis Baseline plus 0.5 p Subscript left-parenthesis j n minus 0.5 n plus 1 right-parenthesis Baseline 2nd Column if n is an even number EndLayout

where tau Subscript j Baseline equals tau 1 plus left-parenthesis j minus 1 right-parenthesis tau Subscript s for j equals 1 comma ellipsis comma q.

The CONDDIST statement assigns the type "Fit and Pooled" and the label "TestFit" to this fitted marginal CDF sample, and assigns the average of the test response values as its response value.

Recall that the marginal CDF is also available for the TESTDATA= data set. For clarity, let upper F Subscript bold upper Y Superscript asterisk Baseline left-parenthesis t right-parenthesis denote the observable marginal CDF for the TESTDATA= data set. Then, by comparing the fitted marginal CDF sample ModifyingAbove upper F With caret Subscript bold upper P Superscript asterisk Baseline left-parenthesis t right-parenthesis with the observed marginal CDF sample ModifyingAbove upper F With caret Subscript bold upper Y Superscript asterisk Baseline left-parenthesis t right-parenthesis, you can test whether the TESTDATA= data set follows the same model that is built on the DATA= data set. The CONDDIST statement supports the Mann-Whitney U test for this purpose when you use the MWU option.

Mann-Whitney U Test

The Mann-Whitney U test (also called the Wilcoxon rank-sum test) is a nonparametric two-sample test. Let n 1 denote the size of the first sample and n 2 denote the size of the second sample. By merging the two samples into one ordered set StartSet a Subscript j Baseline colon j equals 1 comma ellipsis comma n equals n 1 plus n 2 EndSet, the statistic of the Mann-Whitney U test is defined as

upper U equals sigma-summation Underscript j equals 1 Overscript n Endscripts c Subscript j Baseline upper R Subscript j Baseline with c Subscript j Baseline equals StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column if a Subscript j Baseline belongs to the first sample 2nd Row 1st Column 0 2nd Column if a Subscript j Baseline belongs to the second sample EndLayout

where upper R Subscript j is the rank of a Subscript j. Under some regularity conditions, U asymptotically follows a normal distribution whose expectation equals

upper E left-parenthesis upper U right-parenthesis equals StartFraction n 1 Over n EndFraction sigma-summation Underscript j equals 1 Overscript n Endscripts upper R Subscript j Baseline equals StartFraction n 1 left-parenthesis n plus 1 right-parenthesis Over 2 EndFraction

and whose variance equals

normal upper V normal a normal r left-parenthesis upper U right-parenthesis equals StartFraction n 1 n 2 left-parenthesis n plus 1 right-parenthesis Over 12 EndFraction

You can perform the Mann-Whitney U test against the observed marginal CDF sample of the response variable for the following data sets:

  • the DATA= data set in the PROC QUANTREG statement (by using the MWU option in the CONDDIST statement)

  • the TESTDATA= data set in the CONDDIST statement (by using the TESTDATA(MWU)= data set option in the CONDDIST statement)

The null and alternative hypotheses of the Mann-Whitney U test for the CONDDIST statement are, respectively,

StartLayout 1st Row 1st Column Blank 2nd Column upper H 0 colon 3rd Column upper P left-parenthesis bold upper Y Baseline 1 greater-than bold upper Y Baseline 2 right-parenthesis equals upper P left-parenthesis bold upper Y Baseline 1 less-than bold upper Y Baseline 2 right-parenthesis 2nd Row 1st Column Blank 2nd Column upper H 1 colon 3rd Column upper P left-parenthesis bold upper Y Baseline 1 greater-than bold upper Y Baseline 2 right-parenthesis greater-than upper P left-parenthesis bold upper Y Baseline 1 less-than bold upper Y Baseline 2 right-parenthesis or upper P left-parenthesis bold upper Y Baseline 1 greater-than bold upper Y Baseline 2 right-parenthesis less-than upper P left-parenthesis bold upper Y Baseline 1 less-than bold upper Y Baseline 2 right-parenthesis EndLayout

where bold upper Y Baseline 1 denotes the response random variable for a CDF under the hypothesis test such as upper F Subscript bold upper Y or upper F Subscript bold upper Y vertical-bar bold x, and bold upper Y Baseline 2 denotes the random variable for the observable marginal CDF for either the DATA= data set or the TESTDATA= data set.

You can specify the size of the CDF samples by using the MWU(SAMPSIZE=NOBS | NQ) option or the TESTDATA(MWU(SAMPSIZE=NOBS | NQ)) option in the CONDDIST statement. The TESTDATA(MWU(SAMPSIZE=NOBS | NQ)) option overrides the MWU(SAMPSIZE=NOBS | NQ) option.

The SAMPSIZE=NOBS suboption requests that the size of the TrainObs CDF sample, the TestFit CDF sample, and all the predicted observationwise CDF samples be equal to the number of training observations in the DATA= data set in the PROC QUANTREG statement; and it requests that the size of the TestObs CDF sample be equal to the number of testing observations in the TESTDATA= data set in the CONDDIST statement. The SAMPSIZE=NOBS suboption is appropriate if the size of the quantile-level grid is larger than the number of training observations in the DATA= data set in the PROC QUANTREG statement. Otherwise, the Mann-Whitney U test could output an imprecise p-value because the fitted CDF samples might not sufficiently represent the CDF information that the original training data provide.

The SAMPSIZE=NQ suboption requests that the size of the CDF samples for the MWU tests be equal to the size of the quantile-level grid. The SAMPSIZE=NQ suboption is appropriate if the size of the quantile-level grid is smaller than both the number of training observations in the DATA= data set in the PROC QUANTREG statement and the number of testing observations in the TESTDATA= data set in the CONDDIST statement. Otherwise, the Mann-Whitney U test could output a smaller p-value and incorrectly reject upper H 0.

Marginal Distribution Analysis Using the Bootstrap Resampling Method

You can specify the MCDF option in the CONDDIST statement to perform the marginal distributions analysis by using a weighted bootstrap resampling method (Praestgaard and Wellner 1993). The bootstrap method iterates the following steps:

  1. Assign a probability weight p Subscript i to each of the training observations left-parenthesis y Subscript i Baseline comma bold x Subscript i Baseline right-parenthesis in the DATA= data set in the QUANTREG statement. If you specify weights w Subscript i by using the WEIGHT statement, p Subscript i is generated from the Gammaleft-parenthesis w Subscript i Baseline comma 1 right-parenthesis distribution and normalized by using the factor StartFraction sigma-summation w Subscript i Baseline Over sigma-summation p Subscript i Baseline EndFraction, such that the finally assigned p Subscript i satisfies sigma-summation p Subscript i Baseline equals sigma-summation w Subscript i. Otherwise, p Subscript i is generated from the Exponentialleft-parenthesis 1 right-parenthesis distribution and normalized by using StartFraction n Over sigma-summation p Subscript i Baseline EndFraction, such that sigma-summation p Subscript i Baseline equals n, where n is the total number of training observations.

  2. Compute the following statistics by using the p Subscript i-weighted training observations:

    1. Quantile process regression model: ModifyingAbove bold-italic beta With caret left-parenthesis tau right-parenthesis by using the fast quantile process regression algorithm. You can use the FQPR(N=n) option in the MODEL statement to specify the quantile-level grid for your quantile process model.

    2. The TrainObs marginal CDF sample for the p Subscript i-weighted response variable. Let y Subscript left-parenthesis 1 right-parenthesis Baseline less-than-or-equal-to y Subscript left-parenthesis 2 right-parenthesis Baseline less-than-or-equal-to midline-horizontal-ellipsis less-than-or-equal-to y Subscript left-parenthesis n right-parenthesis denote the sorted response values for the training data; let p Subscript left-parenthesis 1 right-parenthesis Baseline comma p Subscript left-parenthesis 2 right-parenthesis Baseline comma ellipsis comma p Subscript left-parenthesis n right-parenthesis Baseline denote their associated normalized bootstrap weights; and define p Subscript left-parenthesis 0 right-parenthesis Baseline equals 0. Then the following formula computes a size-m TrainObs marginal CDF sample:

      left-brace ModifyingAbove upper Q With caret Subscript bold upper Y Baseline left-parenthesis tau right-parenthesis equals y Subscript left-parenthesis j right-parenthesis Baseline if sigma-summation Underscript k equals 0 Overscript j minus 1 Endscripts p Subscript left-parenthesis k right-parenthesis Baseline less-than n tau less-than-or-equal-to sigma-summation Underscript k equals 0 Overscript j Endscripts p Subscript left-parenthesis k right-parenthesis Baseline colon tau equals StartFraction i minus 0.5 Over m EndFraction for i equals 1 comma ellipsis comma m right-brace

      For the estimates in the marginal distribution comparison table, the size of the TrainObs marginal CDF sample equals the number of training observations. For the marginal CDF plot, the size of the TrainObs marginal CDF sample equals the number of quantile levels in the fast quantile process regression algorithm.

If you specify the TESTDATA= data set in the CONDDIST statement, the weighted bootstrap method also iterates the following steps:

  1. Assign a probability weight p Subscript j to each of the testing observations left-parenthesis y Subscript j Baseline comma bold x Subscript j Baseline right-parenthesis by using the same method that is described in Step 1 for the TESTDATA= data set.

  2. Compute quantile predictions bold x prime Subscript j Baseline ModifyingAbove bold-italic beta With caret left-parenthesis tau right-parenthesis for all the testing observations by using the quantile process regression model fitted in Step 2a.

  3. Compute the following statistics:

    1. The TestObs marginal CDF sample for the p Subscript j-weighted response variable. This marginal CDF sample is computed by using the same formula as in Step 2b but on the sorted testing response values.

    2. The TestFit marginal CDF sample for the pooled quantile predictions from Step 5. This marginal CDF sample is computed by using the same formula as in Step 2b but on the sorted pooled quantile predictions, where the weight of each quantile prediction equals the weight of its associated testing observation. You can specify the QRNG= suboption in the MCDF option to specify the limits of the quantile predictions. By default, specifying QRNG=BOTH limits the quantile predictions to be within the range of observed response values that combines the training data and the testing data. If a quantile prediction is smaller than the minimum value of all the training and testing response values, then the quantile prediction would be reset to the minimum value. Similarly, if a quantile prediction is larger than the maximum value of all the training and testing response values, then the quantile prediction is reset to the maximum value.

    3. The CDF comparison scores, including the MWU test, mean difference, and median difference.

    For the estimates in the marginal distribution comparison table, the size of the TestObs and TestFit marginal CDF samples equals the number of testing observations. For the marginal CDF plot, the size of the TestObs and TestFit marginal CDF samples equals the number of quantile levels in the fast quantile process regression algorithm.

The NREP=n suboption in the MCDF option specifies the total number of bootstrap iterations. Let ModifyingAbove theta With caret Subscript s denote the statistic that is computed in the sth bootstrap iteration for a parameter theta. This ModifyingAbove theta With caret Subscript s could be a point in an empirical quantile function ModifyingAbove upper Q With caret Subscript upper Y Baseline left-parenthesis t right-parenthesis, a MWU test score, or any other statistics from Step 2 or Step 6. Accordingly, sigma-summation ModifyingAbove theta With caret Subscript s and sigma-summation ModifyingAbove theta With caret Subscript s Superscript 2, respectively, represent the total cumulative sum and the total cumulative sum of squares of the relevant statistic for all the bootstrap iterations that are cumulated in Step 3 and Step 7. Then the formulas in ???? calculate the final bootstrap estimate and its standard error, Z score, p-value, and confidence limits for each of the relevant statistics.

Statistic Formula
Bootstrap estimate: ModifyingAbove theta With caret equals sigma-summation ModifyingAbove theta With caret Subscript s Baseline slash n
Standard error: normal upper S normal upper E equals StartRoot left-parenthesis sigma-summation ModifyingAbove theta With caret Subscript s Superscript 2 Baseline slash n right-parenthesis minus ModifyingAbove theta With caret squared EndRoot
Z score: ModifyingAbove theta With caret slash normal upper S normal upper E
Two-sided p-value: StartLayout Enlarged left-brace 1st Row 1st Column 2 normal upper Phi left-parenthesis upper Z right-parenthesis 2nd Column if upper Z less-than 0 2nd Row 1st Column 2 normal upper Phi left-parenthesis negative upper Z right-parenthesis 2nd Column if upper Z greater-than-or-equal-to 0 EndLayout
Confidence limits: left-parenthesis ModifyingAbove theta With caret plus normal upper S normal upper E normal upper Phi Superscript negative 1 Baseline left-parenthesis alpha slash 2 right-parenthesis comma ModifyingAbove theta With caret minus normal upper S normal upper E normal upper Phi Superscript negative 1 Baseline left-parenthesis alpha slash 2 right-parenthesis right-parenthesis

In the marginal CDF plot, the confidence limits of a CDF sample are the inverse functions of the confidence limits of its associated empirical quantile function.

Regression Quantile Level and Sample Quantile Level

Given a response value y and a CDF sample for upper F left-parenthesis t right-parenthesis: StartSet ModifyingAbove upper Q With caret left-parenthesis tau 1 right-parenthesis comma ellipsis comma ModifyingAbove upper Q With caret left-parenthesis tau Subscript q Baseline right-parenthesis EndSet, the CONDDIST statement estimates the quantile level of y on upper F left-parenthesis t right-parenthesis by using the following linear interpolation method:

ModifyingAbove tau With caret Subscript y Baseline StartLayout Enlarged left-brace 1st Row 1st Column less-than tau 1 2nd Column if y less-than ModifyingAbove upper Q With caret left-parenthesis tau 1 right-parenthesis 2nd Row 1st Column equals lamda tau Subscript i plus 1 Baseline plus left-parenthesis 1 minus lamda right-parenthesis tau Subscript i Baseline 2nd Column if ModifyingAbove upper Q With caret left-parenthesis tau Subscript i Baseline right-parenthesis less-than-or-equal-to y less-than-or-equal-to ModifyingAbove upper Q With caret left-parenthesis tau Subscript i plus 1 Baseline right-parenthesis for lamda equals left-parenthesis y minus ModifyingAbove upper Q With caret left-parenthesis tau Subscript i Baseline right-parenthesis right-parenthesis slash left-parenthesis ModifyingAbove upper Q With caret left-parenthesis tau Subscript i plus 1 Baseline right-parenthesis minus ModifyingAbove upper Q With caret left-parenthesis tau Subscript i Baseline right-parenthesis right-parenthesis 3rd Row 1st Column greater-than tau Subscript q Baseline 2nd Column if y greater-than ModifyingAbove upper Q With caret left-parenthesis tau Subscript q Baseline right-parenthesis EndLayout

Here ModifyingAbove tau With caret Subscript y is defined as a regression quantile level if the CDF sample is conditional for upper F Subscript bold upper Y vertical-bar bold x Baseline left-parenthesis t right-parenthesis, or a sample quantile level if the CDF sample is observed and marginal for upper F Subscript bold upper Y Baseline left-parenthesis t right-parenthesis.

The PLOT=PPPLOT option in the CONDDIST statement creates the scatter plot for the regression quantile levels versus the sample quantile levels.

Probability Density Functions

The CONDDIST statement estimates the PDF by applying the kernel density estimator to the estimated CDF in the quantile-level grid. An estimated CDF for the CONDDIST statement is in the form of StartSet ModifyingAbove upper F With caret left-parenthesis ModifyingAbove upper Q With caret left-parenthesis tau Subscript j Baseline right-parenthesis right-parenthesis equals tau Subscript j Baseline colon tau Subscript j Baseline equals tau 1 plus left-parenthesis j minus 1 right-parenthesis tau Subscript s Baseline comma j equals 1 comma ellipsis comma q EndSet. The lower end of the estimated PDF is limited by the quantile level max left-parenthesis 0 comma tau 1 minus 0.5 tau Subscript s Baseline right-parenthesis, and the upper end of the estimated PDF is limited by the quantile level min left-parenthesis tau Subscript q Baseline plus 0.5 tau Subscript s Baseline comma 1 right-parenthesis.

The general form of the kernel density estimator is

ModifyingAbove f With caret Subscript lamda Baseline left-parenthesis t right-parenthesis equals StartFraction 1 Over q lamda l EndFraction sigma-summation Underscript j equals 1 Overscript q Endscripts upper K 0 left-parenthesis StartFraction t minus t Subscript j Baseline Over lamda EndFraction right-parenthesis

where

  • upper K 0 left-parenthesis dot right-parenthesis is the kernel function

  • lamda is the bandwidth

  • q is the sample size

  • t Subscript j is the ith value ModifyingAbove upper F With caret left-parenthesis ModifyingAbove upper Q With caret left-parenthesis tau Subscript j Baseline right-parenthesis right-parenthesis

  • l equals q tau Subscript s is the length of the range of the quantile-level grid

The KDE option provides three kernel functions (upper K 0): normal, quadratic, and triangular. You can specify the function by using the K= kernel-option in parentheses after the KDE option. The values of the K= option are NORMAL, QUADRATIC, and TRIANGULAR. By default, a normal kernel is used. The formulas for the kernel functions are as follows:

StartLayout 1st Row 1st Column Normal 2nd Column upper K 0 left-parenthesis t right-parenthesis equals StartFraction 1 Over StartRoot 2 pi EndRoot EndFraction exp left-parenthesis minus one-half t squared right-parenthesis 3rd Column for negative normal infinity less-than t less-than normal infinity 2nd Row 1st Column Quadratic 2nd Column upper K 0 left-parenthesis t right-parenthesis equals three-fourths left-parenthesis 1 minus t squared right-parenthesis 3rd Column for StartAbsoluteValue t EndAbsoluteValue less-than-or-equal-to 1 3rd Row 1st Column Triangular 2nd Column upper K 0 left-parenthesis t right-parenthesis equals 1 minus StartAbsoluteValue t EndAbsoluteValue 3rd Column for StartAbsoluteValue t EndAbsoluteValue less-than-or-equal-to 1 EndLayout

The value of lamda, referred to as the bandwidth parameter, determines the degree of smoothness in the estimated probability density function. You specify lamda indirectly when you specify a standardized bandwidth c by using the C=value suboption. Let Q denote the interquartile range and q denote the sample size; then c is related to lamda by the formula

lamda equals c upper Q q Superscript negative one-fifth

For a specific kernel function, the discrepancy between the density estimator ModifyingAbove f With caret Subscript lamda Baseline left-parenthesis x right-parenthesis and the true density f left-parenthesis x right-parenthesis is measured by the mean integrated square error (MISE):

MISE left-parenthesis lamda right-parenthesis equals integral Underscript x Endscripts StartSet upper E left-parenthesis ModifyingAbove f With caret Subscript lamda Baseline left-parenthesis x right-parenthesis right-parenthesis minus f left-parenthesis x right-parenthesis EndSet squared d x plus integral Underscript x Endscripts normal upper V normal a normal r left-parenthesis ModifyingAbove f With caret Subscript lamda Baseline left-parenthesis x right-parenthesis right-parenthesis d x

MISE is the sum of the integrated squared bias and the variance. An approximate mean integrated square error (AMISE) is defined as

AMISE left-parenthesis lamda right-parenthesis equals one-fourth lamda Superscript 4 Baseline left-parenthesis integral Underscript t Endscripts t squared upper K left-parenthesis t right-parenthesis d t right-parenthesis squared integral Underscript x Endscripts left-parenthesis f double-prime left-parenthesis x right-parenthesis right-parenthesis squared d x plus StartFraction 1 Over q lamda EndFraction integral Underscript t Endscripts upper K left-parenthesis t right-parenthesis squared d t

You can derive a bandwidth that minimizes AMISE by treating f left-parenthesis x right-parenthesis as the normal density whose parameters mu and sigma are estimated by the sample mean and standard deviation, respectively. If you do not specify a bandwidth parameter or if you specify C=MISE, the bandwidth that minimizes AMISE is used. The value of AMISE can be used to compare different density estimates. You can also specify C=SJPI to select the bandwidth by using the Sheather-Jones plug-in method (Jones, Marron, and Sheather 1996).

The general kernel density estimates assume that the domain of the density to be estimated can take all values on a real line. However, sometimes the domain of a density is an interval that is bounded on one or both sides. For example, if a variable Y is a measurement of only positive values, then the kernel density curve should be bounded so that it is zero for negative Y values. You can use the LOWER= and UPPER= kde-options in the PDF= option in the CONDDIST statement to specify the bounds.

Last updated: December 09, 2022