The KDE Procedure

Bandwidth Selection

Several different bandwidth selection methods are available in PROC KDE in the univariate case. Following the recommendations of Jones, Marron, and Sheather (1996), the default method follows a plug-in formula of Sheather and Jones.

This method solves the fixed-point equation

h equals left-bracket StartFraction upper R left-parenthesis phi right-parenthesis Over n upper R left-parenthesis ModifyingAbove f With caret Subscript g left-parenthesis h right-parenthesis Superscript double-prime Baseline right-parenthesis left-parenthesis integral x squared phi left-parenthesis x right-parenthesis d x right-parenthesis squared EndFraction right-bracket Superscript 1 slash 5

where upper R left-parenthesis phi right-parenthesis equals integral phi squared left-parenthesis x right-parenthesis d x.

PROC KDE solves this equation by first evaluating it on a grid of values spaced equally on a log scale. The largest two values from this grid that bound a solution are then used as starting values for a bisection algorithm.

The simple normal reference rule works by assuming ModifyingAbove f With caret is Gaussian in the preceding fixed-point equation. This results in

StartLayout 1st Row 1st Column h 2nd Column equals 3rd Column ModifyingAbove sigma With caret left-bracket 4 slash left-parenthesis 3 n right-parenthesis right-bracket Superscript 1 slash 5 2nd Row 1st Column Blank 2nd Column equals 3rd Column 1.06 ModifyingAbove sigma With caret n Superscript negative 1 slash 5 EndLayout

where ModifyingAbove sigma With caret is the sample standard deviation.

Alternatively, the bandwidth can be computed using the interquartile range, Q:

StartLayout 1st Row 1st Column h 2nd Column equals 3rd Column 1.06 ModifyingAbove sigma With caret n Superscript negative 1 slash 5 Baseline 2nd Row 1st Column Blank 2nd Column almost-equals 3rd Column 1.06 left-parenthesis upper Q slash 1.34 right-parenthesis n Superscript negative 1 slash 5 Baseline 3rd Row 1st Column Blank 2nd Column almost-equals 3rd Column 0.785 upper Q n Superscript negative 1 slash 5 EndLayout

Silverman’s rule of thumb (Silverman 1986, Section 3.4.2) is computed as

h equals 0.9 min left-bracket ModifyingAbove sigma With caret comma upper Q slash 1.34 right-bracket n Superscript negative 1 slash 5

The oversmoothed bandwidth is computed as

h equals 3 ModifyingAbove sigma With caret left-bracket 1 slash left-parenthesis 70 StartRoot pi EndRoot n right-parenthesis right-bracket Superscript 1 slash 5

When you specify a WEIGHT variable, PROC KDE uses weighted versions of upper Q 3, upper Q 1, and ModifyingAbove sigma With caret in the preceding expressions. The weighted quartiles are computed as weighted order statistics, and the weighted variance takes the form

ModifyingAbove sigma With caret squared equals StartFraction sigma-summation Underscript i equals 1 Overscript n Endscripts upper W Subscript i Baseline left-parenthesis upper X Subscript i Baseline minus upper X overbar right-parenthesis squared Over sigma-summation Underscript i equals 1 Overscript n Endscripts upper W Subscript i Baseline EndFraction

where upper X overbar equals left-parenthesis sigma-summation Underscript i equals 1 Overscript n Endscripts upper W Subscript i Baseline upper X Subscript i Baseline right-parenthesis slash left-parenthesis sigma-summation Underscript i equals 1 Overscript n Endscripts upper W Subscript i Baseline right-parenthesis is the weighted sample mean.

For the bivariate case, Wand and Jones (1993) note that automatic bandwidth selection is both difficult and computationally expensive. Their study of various ways of specifying a bandwidth matrix also shows that using two bandwidths, one in each coordinate’s direction, is often adequate. PROC KDE enables you to adjust the two bandwidths by specifying a multiplier for the default bandwidths recommended by Bowman and Foster (1993):

StartLayout 1st Row 1st Column h Subscript upper X 2nd Column equals 3rd Column ModifyingAbove sigma With caret Subscript upper X Baseline n Superscript negative 1 slash 6 2nd Row 1st Column h Subscript upper Y 2nd Column equals 3rd Column ModifyingAbove sigma With caret Subscript upper Y Baseline n Superscript negative 1 slash 6 EndLayout

Here ModifyingAbove sigma With caret Subscript upper X and ModifyingAbove sigma With caret Subscript upper Y are the sample standard deviations of X and Y, respectively. These are the optimal bandwidths for two independent normal variables that have the same variances as X and Y. They are, therefore, conservative in the sense that they tend to oversmooth the surface.

You can specify the BWM= option to adjust the aforementioned bandwidths to provide the appropriate amount of smoothing for your application.

Last updated: December 09, 2022