The KDE Procedure

UNIVAR Statement

  • UNIVAR variable <(v-options)> <…variable <(v-options)>> </ options>;

The UNIVAR statement computes univariate kernel density estimates. You can specify various v-options for each variable by enclosing them in parentheses after the variable name. You can also specify global options following a slash (/). Global options apply to all the variables specified in the UNIVAR statement. However, individual variable v-options override the global options.

Table 2 summarizes the options available in the UNIVAR statement.

Table 2: UNIVAR Statement Options

Option Description
BW= Specifies a bandwidth
BWM= Specifies a bandwidth multiplier
CDF Produces the distribution function
GRIDL= Specifies a lower grid limit
GRIDU= Specifies an upper grid limit
LEVELS Produces a table of levels for contours of the univariate density
METHOD= Specifies which method to use to compute the bandwidth
NGRID= Specifies a number of grid points
NOPRINT Suppresses output tables
OUT= Specifies the output SAS data set to contain the kernel density estimate
PERCENTILES Produces a table of percentiles
PLOTS= Produces plots of the univariate kernel density estimate
SJPIMAX= Specifies the maximum grid value in determining the Sheather-Jones plug-in bandwidth
SJPIMIN= Specifies the minimum grid value in determining the Sheather-Jones plug-in bandwidth
SJPINUM= Specifies the number of grid values to be used in determining the Sheather-Jones plug-in bandwidth
SJPITOL= Specifies the tolerance for termination of the bisection algorithm
TRUNCATE Restricts the lower and upper grid limits to the minimum and maximum observed values, respectively, for each variable
UNISTATS Produces, for each variable, a table that contains standard univariate statistics and the bandwidth


You can specify the following options in the UNIVAR statement. Some options can be used as v-options, as indicated in the description of the option.

BW=number

specifies the bandwidth to apply to each variable in each kernel density estimate. Larger bandwidths produce a smoother estimate, whereas smaller bandwidths produce a rougher estimate. To specify different bandwidths for different variables, specify BW=number as a v-option. By default, the bandwidth is set automatically by the Sheather-Jones plug-in method (see the section Bandwidth Selection).

BWM=number

specifies the bandwidth multiplier to apply to the corresponding bandwidth for each variable. Values of number greater than 1 increase the effective bandwidth and produce a smoother estimate. Values less than 1 decrease the effective bandwidth and produce a rougher estimate. To specify different bandwidth multipliers for different variables, specify BWM=number as a v-option. By default, BWM=1.

CDF

computes the distribution function in addition to the density function for each variable. The distribution function is obtained by a seminumerical technique as described in the section Kernel Distribution Estimates.

GRIDL=number

specifies a lower grid limit for each kernel density estimate. To specify different lower grid limits for different variables, specify GRIDL=number as a v-option. The default value for a particular variable is a function of both the kernel bandwidth and the minimum observed value for that variable.

GRIDU=number

specifies an upper grid limit for each kernel density estimate. To specify different upper grid limits for different variables, specify GRIDU=number as a v-option. The default value for a particular variable is a function of both the kernel bandwidth and the maximum observed value for that variable.

LEVELS
LEVELS=(numlist)

computes a table of levels (called "Levels") for contours of the univariate density. The number of contours is equal to the number of values in numlist, where each value in numlist specifies a percentage to be used in calculating the density area that is enclosed by the contour. The contours are defined such that the density has a constant level along each contour, and the area enclosed by each contour corresponds to the total density area minus the specified percentage of the total area. In other words, the contours correspond to slices or levels of the density surface that are taken along the density axis. The "Levels" table also provides the minimum and maximum values for each contour along the direction of the data variable. By default, LEVELS=(1, 5, 10, 50, 90, 95, 99, 100).

METHOD=SJPI | SNR | SNRQ | SROT | OS

specifies the method for computing the bandwidth. You can specify the following values:

SJPI

Sheather-Jones plug-in method

SNR

simple normal reference method

SNRQ

simple normal reference method using interquartile range

SROT

Silverman’s rule of thumb method

OS

oversmoothed method

For a description of these methods, see the section Bandwidth Selection and Jones, Marron, and Sheather (1996). By default, METHOD=SJPI.

Note: The BW= option takes precedence over the METHOD= option. If both are specified, the METHOD= option is ignored.

NGRID=number
NG=number

specifies a number of grid points to use for each kernel density estimate. To specify different numbers of grid points for different variables, specify NGRID=number as a v-option. By default, NGRID=401.

NOPRINT

suppresses output tables. You can use this option when you want to produce only graphical output.

OUT=SAS-data-set

names the output SAS data set to contain the kernel density estimate. This output data set contains the following variables:

  • var, whose value is the name of the variable in the kernel density estimate

  • value, whose value corresponds to grid coordinates for the variable

  • density, whose values are equal to the kernel density estimate at the associated grid point

  • count, whose values indicate the number of original observations contained in the bin that corresponds to a grid point

  • distribution, whose values are equal to the distribution estimate at the associated grid point (appears only when the CDF global option is specified)

PERCENTILES
PERCENTILES=numlist

produces a table of percentiles for each UNIVAR variable. You can specify a list of percentiles to be computed in numlist. The default percentiles are 0.5, 1, 2.5, 5, 10, 25, 50, 75, 90, 95, 97.5, 99, and 99.5.

PLOTS=(plot-request<(options)> <…plot-request <(options)>>)

specifies which plots of the univariate kernel density estimate to produce. When you specify only one plot-request, you can omit the parentheses around the plot-request.

ODS Graphics must be enabled before plots can be requested. For example:

ods graphics on;

proc kde data=channel;
   univar length / plots=histdensity;
run;

ods graphics off;

For more information about enabling and disabling ODS Graphics, see the section Enabling and Disabling ODS Graphics in Chapter 24, Statistical Graphics Using ODS.

You can specify the following plot-requests, each of which (except for DENSITYOVERLAY) produces a separate plot for every variable listed in the UNIVAR statement:

ALL

produces all plots.

DENSITY

produces the univariate kernel density estimate curve.

DENSITYOVERLAY

produces the overlaid univariate kernel density estimate curves. If you specify more than one variable in the UNIVAR statement, PROC KDE overlays the density curves for all the variables on a single plot.

HISTDENSITY

produces the univariate histogram of data overlaid with kernel density estimate curve.

HISTOGRAM

produces the univariate histogram of data.

NONE

suppresses all plots.

By default, if ODS Graphics is enabled and you do not specify the PLOTS= option, then the UNIVAR statement creates a histogram overlaid with a kernel density estimate. If you specify the PLOTS= option, only the requested plots are created.

SJPIMAX=number

specifies the maximum grid value in determining the Sheather-Jones plug-in bandwidth. The default value is two times the oversmoothed estimate.

SJPIMIN=number

specifies the minimum grid value in determining the Sheather-Jones plug-in bandwidth. The default value is the maximum value divided by 18.

SJPINUM=number

specifies the number of grid values to use in determining the Sheather-Jones plug-in bandwidth. By default, SJPINUM=21.

SJPITOL=number

specifies the tolerance for terminating the bisection algorithm that is used in computing the Sheather-Jones plug-in bandwidth. By default, SJPITOL=0.001.

TRUNCATE

sets the lower grid limit for each variable to the minimum observed for that variable, and sets the upper grid limit for each variable to the maximum observed value for that variable.

Note: The GRIDL and GRIDU options take precedence over the TRUNCATE option. If one or both are specified, the corresponding lower and upper grid limits are set accordingly.

UNISTATS

produces, for each variable a table that contains standard univariate statistics and the bandwidth that are used to compute its kernel density estimate. The statistics listed are the mean, variance, standard deviation, range, and interquartile range.

Examples

Suppose the data set MyData contains the variables x1, x2, x3, and x4. The following statements request a univariate kernel density estimate for each of these variables:

proc kde data=MyData;
   univar x1 x2 x3 x4;
run;

You can also specify different bandwidths and other options for each variable. For example, the following statements request kernel density estimates that use Silverman’s rule of thumb (SROT) method for all variables:

proc kde data=MyData;
   univar x1 (bwm=2)
          x2 (bwm=0.5 ngrid=100)
          x3 x4 / ngrid=200 method=srot;
run;

The NGRID=200 option applies to the variables x1, x3, and x4, but the v-option NGRID=100 applies only to x2. Bandwidth multipliers of 2 and 0.5 are specified for the variables x1 and x2, respectively.

Last updated: December 09, 2022