The KDE Procedure

Binning

Binning, or assigning data to discrete categories, is an effective and fast method for large data sets (Fan and Marron 1994). When the sample size n is large, direct evaluation of the kernel estimate ModifyingAbove f With caret at any point would involve n kernel evaluations, as shown in the preceding formulas. To evaluate the estimate at each point of a grid of size g would thus require n g kernel evaluations. When you use g = 401 in the univariate case or g equals 60 times 60 equals 3600 in the bivariate case and n greater-than-or-equal-to 1000, the amount of computation can be prohibitively large. With binning, however, the computational order is reduced to g, resulting in a much quicker algorithm that is nearly as accurate as direct evaluation.

To bin a set of weighted univariate data upper X 1 comma upper X 2 comma ellipsis comma upper X Subscript n Baseline to a grid x 1 comma x 2 comma ellipsis comma x Subscript g Baseline, simply assign each sample upper X Subscript i, together with its weight upper W Subscript i, to the nearest grid point x Subscript j (also called the bin center). When binning is completed, each grid point x Subscript i has an associated number c Subscript i, which is the sum total of all the weights that correspond to sample points that have been assigned to x Subscript i. These c Subscript is are known as the bin counts.

This procedure replaces the data left-parenthesis upper X Subscript i Baseline comma upper W Subscript i Baseline right-parenthesis comma i equals 1 comma 2 comma ellipsis comma n, with the smaller set left-parenthesis x Subscript i Baseline comma c Subscript i Baseline right-parenthesis comma i equals 1 comma 2 comma ellipsis comma g, and the estimation is carried out with these new data. This is so-called simple binning, versus the finer linear binning described in Wand (1994). PROC KDE uses simple binning for the sake of faster and easier implementation. Also, it is assumed that the bin centers x 1 comma x 2 comma ellipsis comma x Subscript g Baseline are equally spaced and in increasing order. In addition, assume for notational convenience that sigma-summation Underscript i equals 1 Overscript n Endscripts upper W Subscript i Baseline equals n and, therefore, sigma-summation Underscript i equals 1 Overscript g Endscripts c Subscript i Baseline equals n.

If you replace the data left-parenthesis upper X Subscript i Baseline comma upper W Subscript i Baseline right-parenthesis comma i equals 1 comma 2 comma ellipsis comma n, with left-parenthesis x Subscript i Baseline comma c Subscript i Baseline right-parenthesis comma i equals 1 comma 2 comma ellipsis comma g, the weighted estimator ModifyingAbove f With caret then becomes

ModifyingAbove f With caret left-parenthesis x right-parenthesis equals StartFraction 1 Over n EndFraction sigma-summation Underscript i equals 1 Overscript g Endscripts c Subscript i Baseline phi Subscript h Baseline left-parenthesis x minus x Subscript i Baseline right-parenthesis

with the same notation as used previously. To evaluate this estimator at the g points of the same grid vector g r i d equals left-parenthesis x 1 comma x 2 comma ellipsis comma x Subscript g Baseline right-parenthesis prime is to calculate

ModifyingAbove f With caret left-parenthesis x Subscript i Baseline right-parenthesis equals StartFraction 1 Over n EndFraction sigma-summation Underscript j equals 1 Overscript g Endscripts c Subscript j Baseline phi Subscript h Baseline left-parenthesis x Subscript i Baseline minus x Subscript j Baseline right-parenthesis

for i equals 1 comma 2 comma ellipsis comma g. This can be rewritten as

ModifyingAbove f With caret left-parenthesis x Subscript i Baseline right-parenthesis equals StartFraction 1 Over n EndFraction sigma-summation Underscript j equals 1 Overscript g Endscripts c Subscript j Baseline phi Subscript h Baseline left-parenthesis StartAbsoluteValue i minus j EndAbsoluteValue delta right-parenthesis

where delta equals x 2 minus x 1 is the increment of the grid.

The same idea of binning works similarly with bivariate data, where you estimate ModifyingAbove f With caret over the grid matrix g r i d equals g r i d Subscript upper X Baseline times g r i d Subscript upper Y as follows:

g r i d equals Start 4 By 4 Matrix 1st Row 1st Column left-parenthesis x 1 comma y 1 right-parenthesis 2nd Column left-parenthesis x 1 comma y 2 right-parenthesis 3rd Column ellipsis 4th Column left-parenthesis x 1 comma y Subscript g Sub Subscript upper Y Subscript Baseline right-parenthesis 2nd Row 1st Column left-parenthesis x 2 comma y 1 right-parenthesis 2nd Column left-parenthesis x 2 comma y 2 right-parenthesis 3rd Column ellipsis 4th Column left-parenthesis x 2 comma y Subscript g Sub Subscript upper Y Subscript Baseline right-parenthesis 3rd Row 1st Column vertical-ellipsis 4th Row 1st Column left-parenthesis x Subscript g Sub Subscript upper X Subscript Baseline comma y 1 right-parenthesis 2nd Column left-parenthesis x Subscript g Sub Subscript upper X Subscript Baseline comma y 2 right-parenthesis 3rd Column ellipsis 4th Column left-parenthesis x Subscript g Sub Subscript upper X Subscript Baseline comma y Subscript g Sub Subscript upper Y Subscript Baseline right-parenthesis EndMatrix

The density estimates are then

ModifyingAbove f With caret left-parenthesis x Subscript i Baseline comma y Subscript j Baseline right-parenthesis equals StartFraction 1 Over n EndFraction sigma-summation Underscript k equals 1 Overscript g Subscript upper X Baseline Endscripts sigma-summation Underscript l equals 1 Overscript g Subscript upper Y Baseline Endscripts c Subscript k comma l Baseline phi Subscript bold h Baseline left-parenthesis StartAbsoluteValue i minus k EndAbsoluteValue delta Subscript upper X Baseline comma StartAbsoluteValue j minus l EndAbsoluteValue delta Subscript upper Y Baseline right-parenthesis

where delta Subscript upper X Baseline equals x 2 minus x 1 and delta Subscript upper Y Baseline equals y 2 minus y 1 are the increments of the grid.

Last updated: December 09, 2022