The SURVEYPHREG Procedure

Notation and Estimation

Let upper U equals StartSet 1 comma 2 comma ellipsis comma upper N EndSet be the set of indices and let script upper F Subscript upper N be the set of values for a finite population of size N. The survival time of each member of the finite population is assumed to follow its own hazard function, lamda Subscript i Baseline left-parenthesis t right-parenthesis, expressed as

lamda Subscript i Baseline left-parenthesis t right-parenthesis equals lamda left-parenthesis t semicolon bold upper Z Subscript i Baseline left-parenthesis t right-parenthesis right-parenthesis equals lamda 0 left-parenthesis t right-parenthesis exp left-parenthesis bold upper Z prime Subscript i Baseline left-parenthesis t right-parenthesis bold-italic beta right-parenthesis

where lamda 0 left-parenthesis t right-parenthesis is an arbitrary and unspecified baseline hazard function, bold upper Z Subscript i Baseline left-parenthesis t right-parenthesis is the vector of explanatory variables for the ith unit at time t, and bold-italic beta is the vector of unknown regression parameters that are associated with the explanatory variables. The vector bold-italic beta is assumed to be the same for all individuals.

The partial likelihood function introduced by Cox (1972, 1975) eliminates the unknown baseline hazard lamda 0 left-parenthesis t right-parenthesis and accounts for censored survival times. If the entire population is observed, then this partial likelihood can be used to estimate bold-italic beta. Let bold-italic beta Subscript upper N be the desired estimator. Assuming a working model with uncorrelated responses, bold-italic beta Subscript upper N is obtained by maximizing the partial log likelihood,

l left-parenthesis bold-italic beta right-parenthesis equals sigma-summation Underscript i element-of upper U Endscripts log left-brace upper L left-parenthesis bold-italic beta semicolon bold upper Z Subscript i Baseline left-parenthesis t right-parenthesis comma t Subscript i Baseline right-parenthesis right-brace

with respect to bold-italic beta, where Lleft-parenthesis bold-italic beta semicolon bold upper Z Subscript i Baseline left-parenthesis t right-parenthesis comma t Subscript i Baseline right-parenthesis is Cox’s partial likelihood function.

Assume that probability sample A is selected from the finite population U and pi Subscript i is the selection probability for unit i. Further assume that covariates bold upper Z Subscript i Baseline left-parenthesis t right-parenthesis and survival time t Subscript i are available for every unit in the sample A. An estimator of the finite population log likelihood is

l Subscript pi Baseline left-parenthesis bold-italic beta right-parenthesis equals sigma-summation Underscript i element-of upper A Endscripts pi Subscript i Superscript negative 1 Baseline log left-brace upper L left-parenthesis bold-italic beta semicolon bold upper Z Subscript i Baseline left-parenthesis t right-parenthesis comma t Subscript i Baseline right-parenthesis right-brace

See Partial Likelihood Function for the Cox Model for more details.

A sample-based estimator ModifyingAbove bold-italic beta With caret for the finite population quantity bold-italic beta Subscript upper N can be obtained by maximizing the partial pseudo-log-likelihood l Subscript pi Baseline left-parenthesis bold-italic beta semicolon bold upper Z Subscript i Baseline left-parenthesis t right-parenthesis comma t Subscript i Baseline right-parenthesis with respect to bold-italic beta. The design-based variance for ModifyingAbove bold-italic beta With caret is obtained by assuming the set of finite population values script upper F Subscript upper N as fixed. For more information about maximum pseudo-likelihood estimators and other inferential approaches for survey data, see Kish and Frankel (1974); Godambe and Thompson (1986); Pfeffermann (1993), Korn and Graubard (1999, chapter 3), Chambers and Skinner (2003, chapter 2), and Fuller (2009, section 6.5). Maximum pseudo-likelihood estimators and their properties for Cox’s proportional hazards model for survey data are discussed in Binder (1990, 1992); Lin and Wei (1989); Lin (2000); Boudreau and Lawless (2006).

Without loss of generality, the rest of this section uses indices for stratified clustered designs. For a stratified clustered sample design, observations are represented by a matrix

left-parenthesis bold w bold comma bold t bold comma bold upper Delta bold comma bold upper Z right-parenthesis equals left-parenthesis w Subscript h i j Baseline comma t Subscript h i j Baseline comma normal upper Delta Subscript h i j Baseline comma bold z Subscript h i j Baseline right-parenthesis

where

  • bold w denotes the vector of sampling weights

  • bold t denotes the event time variable

  • bold upper Delta denotes the event indicator

  • bold upper Z denotes the n times p matrix of auxiliary information

  • h equals 1 comma 2 comma ellipsis comma upper H is the stratum index

  • i equals 1 comma 2 comma ellipsis comma n Subscript h Baseline is the cluster index within stratum h

  • j equals 1 comma 2 comma ellipsis comma m Subscript h i Baseline is the unit index within cluster i of stratum h

  • p is the total number of parameters

  • n equals sigma-summation Underscript h equals 1 Overscript upper H Endscripts sigma-summation Underscript i equals 1 Overscript n Subscript h Baseline Endscripts m Subscript h i   is the total number of observations in the sample

  • y Subscript h i j Baseline left-parenthesis t right-parenthesis equals upper I left-parenthesis t Subscript h i j Baseline greater-than-or-equal-to t right-parenthesis, where upper I left-parenthesis dot right-parenthesis is an indicator function

  • n Subscript h i j Baseline left-parenthesis t right-parenthesis equals upper I left-parenthesis t Subscript h i j Baseline less-than-or-equal-to t right-parenthesis, where upper I left-parenthesis dot right-parenthesis is an indicator function

Let sigma-summation Underscript upper B Endscripts equals sigma-summation Underscript left-parenthesis h comma i comma j right-parenthesis element-of upper B Endscripts denote the summation over the set of indices such that the observation unit j in PSU i and stratum h belongs to the index set B. Typically, B is the set of all population indices that are in the sample, the risk set, or the set of all units with a failure.

The first-stage sampling rate (fraction of PSUs selected for the sample) is denoted by f Subscript h. The first-stage sampling rate is used in Taylor series or bootstrap variance estimation. You can specify the stratum sampling rates with the RATE= option. Or if you specify population totals with the TOTAL= option, PROC SURVEYPHREG computes f Subscript h as the ratio of stratum sample size to the stratum total, in terms of PSUs. See the section Population Totals and Sampling Rates for details. If you do not specify the RATE= option or the TOTAL= option, then the procedure assumes that the stratum sampling rates f Subscript h are negligible and does not use a finite population correction when computing variances.

Last updated: December 09, 2022