The ICPHREG Procedure

EM Algorithm and Extensions

The expectation-maximization (EM) algorithm, as described in Wang et al. (2016) and Zeng, Mao, and Lin (2016), can be used to fit certain types of proportional hazards models to interval-censored data.

Suppose that the observations to be analyzed consist of interval-censored outcomes , , where n is the number of subjects. denotes a p-dimensional vector of covariates for the ith subject.

Assuming that there is no exact observation (), the full likelihood function is

where indicates whether the ith subject is left-censored (), indicates whether the ith subject is interval-censored (), and indicates whether the ith subject is right-censored ().

Assume that the baseline hazard function is of the following form,

normal upper Lamda 0 left-parenthesis t right-parenthesis equals sigma-summation Underscript k equals 1 Overscript upper K Endscripts gamma Subscript k Baseline b Subscript k Baseline left-parenthesis t right-parenthesis

where are known functions that are nondecreasing and nonnegative, and are nonnegative baseline parameters.

Let be a set of latent variables that follow Poisson distributions with means . Let be a set of latent variables that follow Poisson distributions with means . Define and .

The full likelihood can be rewritten as

normal upper L left-parenthesis bold-italic theta right-parenthesis equals product Underscript i equals 1 Overscript n Endscripts upper P left-parenthesis upper W Subscript i Baseline greater-than 0 right-parenthesis Superscript normal upper Delta Super Subscript i Baseline 1 Baseline upper P left-parenthesis upper W Subscript i Baseline equals 0 comma upper U Subscript i Baseline greater-than 0 right-parenthesis Superscript normal upper Delta Super Subscript i Baseline 2 Baseline upper P left-parenthesis upper W Subscript i Baseline equals 0 comma upper U Subscript i Baseline equals 0 right-parenthesis Superscript normal upper Delta Super Subscript i Baseline 3

The complete-data likelihood is

normal upper L Subscript c Baseline left-parenthesis bold-italic theta right-parenthesis equals product Underscript i equals 1 Overscript n Endscripts product Underscript k equals 1 Overscript upper K Endscripts f Subscript upper W Sub Subscript i k Baseline left-parenthesis upper W Subscript i k Baseline right-parenthesis f Subscript upper U Sub Subscript i k Baseline left-parenthesis upper U Subscript i k Baseline right-parenthesis Superscript normal upper Delta Super Subscript i Baseline 2 Superscript plus normal upper Delta Super Subscript i Baseline 3

where denotes the Poisson probability mass function for the variable V. It is straightforward to verify that the integration of with respect to latent variables leads to the full likelihood .

The EM algorithm proceeds as follows. Let the current parameter estimates be . Define

StartLayout 1st Row 1st Column w Subscript i k 2nd Column equals 3rd Column upper E left-parenthesis upper W Subscript i j Baseline vertical-bar script upper D comma bold-italic theta Superscript left-parenthesis d right-parenthesis Baseline right-parenthesis 2nd Row 1st Column u Subscript i k 2nd Column equals 3rd Column upper E left-parenthesis upper U Subscript i j Baseline vertical-bar script upper D comma bold-italic theta Superscript left-parenthesis d right-parenthesis Baseline right-parenthesis 3rd Row 1st Column w Subscript i 2nd Column equals 3rd Column upper E left-parenthesis upper W Subscript i Baseline vertical-bar script upper D comma bold-italic theta Superscript left-parenthesis d right-parenthesis Baseline right-parenthesis 4th Row 1st Column u Subscript i 2nd Column equals 3rd Column upper E left-parenthesis upper U Subscript i Baseline vertical-bar script upper D comma bold-italic theta Superscript left-parenthesis d right-parenthesis Baseline right-parenthesis EndLayout

The expected complete-data log likelihood is computed as

where is a constant.

The quantities and are computed as follows,

w Subscript i k Baseline equals StartFraction gamma Subscript k Superscript left-parenthesis d right-parenthesis Baseline b Subscript k Baseline left-parenthesis upper R Subscript i Baseline right-parenthesis w Subscript i Baseline Over normal upper Lamda 0 Superscript left-parenthesis d right-parenthesis Baseline left-parenthesis upper R Subscript i Baseline right-parenthesis EndFraction

u Subscript i k Baseline equals StartFraction gamma Subscript k Superscript left-parenthesis d right-parenthesis Baseline left-bracket b Subscript k Baseline left-parenthesis upper R Subscript i Baseline right-parenthesis minus b Subscript k Baseline left-parenthesis upper L Subscript i Baseline right-parenthesis right-bracket u Subscript i Baseline Over normal upper Lamda 0 Superscript left-parenthesis d right-parenthesis Baseline left-parenthesis upper R Subscript i Baseline right-parenthesis minus normal upper Lamda 0 Superscript left-parenthesis d right-parenthesis Baseline left-parenthesis upper L Subscript i Baseline right-parenthesis EndFraction

where

w Subscript i Baseline equals StartFraction normal upper Lamda 0 Superscript left-parenthesis d right-parenthesis Baseline left-parenthesis upper R Subscript i Baseline right-parenthesis exp left-parenthesis bold upper Z prime Subscript i Baseline bold-italic beta Superscript left-parenthesis d right-parenthesis Baseline right-parenthesis normal upper Delta Subscript i Baseline 1 Baseline Over 1 minus exp left-bracket normal upper Lamda 0 Superscript left-parenthesis d right-parenthesis Baseline left-parenthesis upper R Subscript i Baseline right-parenthesis exp left-parenthesis bold upper Z prime Subscript i Baseline bold-italic beta Superscript left-parenthesis d right-parenthesis Baseline right-parenthesis right-bracket EndFraction

u Subscript i Baseline equals StartFraction left-bracket normal upper Lamda 0 Superscript left-parenthesis d right-parenthesis Baseline left-parenthesis upper R Subscript i Baseline right-parenthesis minus normal upper Lamda 0 Superscript left-parenthesis d right-parenthesis Baseline left-parenthesis upper L Subscript i Baseline right-parenthesis right-bracket exp left-parenthesis bold upper Z prime Subscript i Baseline bold-italic beta Superscript left-parenthesis d right-parenthesis Baseline right-parenthesis normal upper Delta Subscript i Baseline 2 Baseline Over 1 minus exp left-brace left-bracket normal upper Lamda 0 Superscript left-parenthesis d right-parenthesis Baseline left-parenthesis upper R Subscript i Baseline right-parenthesis minus normal upper Lamda 0 Superscript left-parenthesis d right-parenthesis Baseline left-parenthesis upper L Subscript i Baseline right-parenthesis right-bracket exp left-parenthesis bold upper Z prime Subscript i Baseline bold-italic beta Superscript left-parenthesis d right-parenthesis Baseline right-parenthesis right-brace EndFraction

Solve for . It follows that the can be updated as

gamma Subscript k Superscript left-parenthesis d plus 1 right-parenthesis Baseline equals StartFraction sigma-summation Underscript i equals 1 Overscript n Endscripts left-parenthesis z Subscript i k Baseline plus normal upper Delta Subscript i Baseline 2 Baseline w Subscript i k Baseline right-parenthesis Over sigma-summation Underscript i equals 1 Overscript n Endscripts left-bracket left-parenthesis normal upper Delta Subscript i Baseline 1 Baseline plus normal upper Delta Subscript i Baseline 2 Baseline right-parenthesis b Subscript k Baseline left-parenthesis upper R Subscript i Baseline right-parenthesis plus normal upper Delta Subscript i Baseline 3 Baseline b Subscript k Baseline left-parenthesis upper L Subscript i Baseline right-parenthesis right-bracket exp left-parenthesis bold upper Z prime Subscript i Baseline bold-italic beta right-parenthesis EndFraction

The partial derivative of with respect to is

StartFraction partial-differential normal upper Q left-parenthesis bold-italic theta comma bold-italic theta Superscript left-parenthesis d right-parenthesis Baseline right-parenthesis Over partial-differential bold-italic beta EndFraction equals sigma-summation Underscript i equals 1 Overscript n Endscripts StartSet left-parenthesis w Subscript i Baseline plus left-parenthesis normal upper Delta Subscript i Baseline 2 Baseline plus normal upper Delta Subscript i Baseline 3 Baseline right-parenthesis u Subscript i Baseline right-parenthesis minus left-bracket left-parenthesis normal upper Delta Subscript i Baseline 1 Baseline plus normal upper Delta Subscript i Baseline 2 Baseline right-parenthesis normal upper Lamda 0 Superscript left-parenthesis d right-parenthesis Baseline left-parenthesis upper R Subscript i Baseline right-parenthesis plus normal upper Delta Subscript i Baseline 3 Baseline normal upper Lamda 0 Superscript left-parenthesis d right-parenthesis Baseline left-parenthesis upper L Subscript i Baseline right-parenthesis right-bracket exp left-parenthesis bold upper Z prime Subscript i Baseline bold-italic beta right-parenthesis EndSet bold upper Z Subscript i

After plugging in , you can update the parameters by using the one-step Newton-Raphson method (Zeng, Mao, and Lin 2016).

The EM algorithm alternates between updating and updating until convergence.

You can use the EM algorithm to fit the semiparametric model and the piecewise constant hazard model in PROC ICPHREG. The option is NLOPTIONS(TECH=EM) in the PROC ICPHREG statement.

Semiparametric Model and Time-Dependent Covariates

A typical way that interval-censored data are generated is through a process of repeated assessments. Suppose that are a random sequence of assessment times. Denote and , where .

For the ith subject, , let , , , , and be the number of assessments, event time, assessment time vector, the indicator vector, and time-dependent covariate process, respectively. Suppose that is interval-censored between two assessment times, and , where and . Let be the sorted right boundaries of the Turnbull intervals for .

Suppose that the time-dependent covariates process change value only at assessment times. Let be the observed covariate vectors at times .

Under the semiparametric model, the baseline cumulative hazard function is

normal upper Lamda 0 left-parenthesis t right-parenthesis equals sigma-summation Underscript j colon s Subscript j Baseline less-than t Endscripts gamma Subscript j Baseline comma j equals 1 comma ellipsis comma upper J

For the ith subject, the cumulative hazard function is computed as

normal upper Lamda Subscript i Baseline left-parenthesis t right-parenthesis equals sigma-summation Underscript j colon s Subscript j Baseline less-than t Endscripts gamma Subscript j Baseline exp left-parenthesis bold upper Z prime Subscript i j Baseline bold-italic beta right-parenthesis

The full likelihood function is

normal upper L left-parenthesis bold-italic theta right-parenthesis equals product Underscript i equals 1 Overscript n Endscripts StartSet 1 minus exp left-bracket minus normal upper Lamda Subscript i Baseline left-parenthesis upper R Subscript i Baseline right-parenthesis right-bracket EndSet Superscript normal upper Delta Super Subscript i Baseline 1 Baseline StartSet exp left-bracket minus normal upper Lamda Subscript i Baseline left-parenthesis upper L Subscript i Baseline right-parenthesis right-bracket minus exp left-bracket minus normal upper Lamda Subscript i Baseline left-parenthesis upper R Subscript i Baseline right-parenthesis right-bracket EndSet Superscript normal upper Delta Super Subscript i Baseline 2 Baseline StartSet exp left-bracket minus normal upper Lamda Subscript i Baseline left-parenthesis upper L Subscript i Baseline right-parenthesis right-bracket EndSet Superscript normal upper Delta Super Subscript i Baseline 3

where indicates whether the ith subject is left-censored (), indicates whether the ith subject is interval-censored (), and indicates whether the ith subject is right-censored ().

As the following derivation shows, the EM algorithm can be adapted straightforwardly to fit the semiparametric model that contains time-dependent covariates.

Let , and redefine the latent Poisson variables as

upper E left-parenthesis upper W Subscript i j Baseline right-parenthesis equals gamma Subscript j Baseline exp left-parenthesis bold upper Z prime Subscript i j Baseline bold-italic beta right-parenthesis b Subscript j Baseline left-parenthesis upper R Subscript i Baseline right-parenthesis equals gamma Subscript j Baseline exp left-parenthesis bold upper Z prime Subscript i j Baseline bold-italic beta right-parenthesis upper I left-parenthesis s Subscript j Baseline less-than upper R Subscript i Baseline right-parenthesis

upper E left-parenthesis upper U Subscript i j Baseline right-parenthesis equals gamma Subscript j Baseline exp left-parenthesis bold upper Z prime Subscript i j Baseline bold-italic beta right-parenthesis left-bracket b Subscript j Baseline left-parenthesis upper R Subscript i Baseline right-parenthesis minus b Subscript j Baseline left-parenthesis upper L Subscript i Baseline right-parenthesis right-bracket equals gamma Subscript j Baseline exp left-parenthesis bold upper Z prime Subscript i j Baseline bold-italic beta right-parenthesis upper I left-parenthesis upper L Subscript i Baseline less-than s Subscript j Baseline less-than upper R Subscript i Baseline right-parenthesis

The expected complete-data log likelihood becomes

where is a constant and and are computed as follows:

w Subscript i k Baseline equals StartFraction gamma Subscript k Superscript left-parenthesis d right-parenthesis Baseline b Subscript j Baseline left-parenthesis upper R Subscript i Baseline right-parenthesis exp left-parenthesis bold upper Z prime Subscript i j Baseline bold-italic beta right-parenthesis w Subscript i Baseline Over normal upper Lamda Subscript i Superscript left-parenthesis d right-parenthesis Baseline left-parenthesis upper R Subscript i Baseline right-parenthesis EndFraction

w Subscript i Baseline equals StartFraction normal upper Lamda Subscript i Superscript left-parenthesis d right-parenthesis Baseline left-parenthesis upper R Subscript i Baseline right-parenthesis normal upper Delta Subscript i Baseline 1 Baseline Over 1 minus exp left-bracket normal upper Lamda Subscript i Superscript left-parenthesis d right-parenthesis Baseline left-parenthesis upper R Subscript i Baseline right-parenthesis right-bracket EndFraction

u Subscript i Baseline equals StartFraction left-bracket normal upper Lamda Subscript i Superscript left-parenthesis d right-parenthesis Baseline left-parenthesis upper R Subscript i Baseline right-parenthesis minus normal upper Lamda Subscript i Superscript left-parenthesis d right-parenthesis Baseline left-parenthesis upper L Subscript i Baseline right-parenthesis right-bracket normal upper Delta Subscript i Baseline 2 Baseline Over 1 minus exp left-brace left-bracket normal upper Lamda Subscript i Superscript left-parenthesis d right-parenthesis Baseline left-parenthesis upper R Subscript i Baseline right-parenthesis minus normal upper Lamda Subscript i Superscript left-parenthesis d right-parenthesis Baseline left-parenthesis upper L Subscript i Baseline right-parenthesis right-bracket right-brace EndFraction

You use the ID statement to fit the semiparametric model that contains time-dependent covariates. The levels of the ID variable identify the subjects to be analyzed.

Variance Estimation

Louis’s Method

Let be the maximum likelihood estimates as found by the EM algorithm. Under suitable conditions, you can apply Louis’s method (Louis 1982) to obtain the covariance matrix of .

The observed information matrix is computed as

upper I left-parenthesis ModifyingAbove bold-italic theta With caret right-parenthesis equals minus StartFraction partial-differential squared normal upper Q left-parenthesis bold-italic theta comma ModifyingAbove bold-italic theta With caret right-parenthesis Over partial-differential bold-italic theta partial-differential bold-italic theta prime EndFraction minus normal c normal o normal v StartSet StartFraction partial-differential log normal upper L Subscript c Baseline left-parenthesis bold-italic theta right-parenthesis Over partial-differential bold-italic theta EndFraction vertical-bar Subscript bold-italic theta equals ModifyingAbove bold-italic theta With caret Baseline EndSet

and its inverse is the estimated covariance of .

Louis’s method is the default method of calculating standard errors for the semiparametric model.

Profile Likelihood Method

You can use the profile likelihood method of Murphy and Van Der Vaart (2000) to estimate the covariance matrix of . The profile log-likelihood function is defined as

normal upper P normal upper L left-parenthesis bold-italic beta right-parenthesis equals max Underscript bold-italic gamma element-of script left-parenthesis upper D right-parenthesis Endscripts log normal upper L left-parenthesis bold-italic beta comma bold-italic gamma right-parenthesis

where is the parameter space of .

The Hessian matrix of can be computed using numerical differentiation. Let be the identity vector for , and let l be a small perturbation. The th element of the Hessian matrix can be approximated by

upper H Subscript i j Baseline equals StartFraction normal upper P normal upper L left-parenthesis bold-italic beta right-parenthesis minus normal upper P normal upper L left-parenthesis bold-italic beta plus l dot bold e Subscript j Baseline right-parenthesis minus normal upper P normal upper L left-parenthesis bold-italic beta plus l dot bold e Subscript k Baseline right-parenthesis minus normal upper P normal upper L left-parenthesis bold-italic beta plus l dot bold e Subscript j Baseline plus l dot bold e Subscript k Baseline right-parenthesis Over l squared EndFraction

The covariance matrix of is estimated by inverting the negative of the Hessian matrix.

You can use the profile likelihood method for the semiparametric model by specifying the PLVARIANCE option in the MODEL statement. But be aware that this computation is iterative and can consume a relatively large amount of CPU time.

EMICM Algorithm

Pan (1999) proposes using the iterative convex minorant (ICM) algorithm to fit semiparametric proportional hazards models to interval-censored data.

Define . Denote and . The full likelihood function can be rewritten in terms of and the regression coefficients .

Maximizing the likelihood with respect to is equivalent to maximizing it with respect to . Because the are naturally ordered, the optimization is subject to the following constraint:

upper C equals StartSet bold x equals left-parenthesis alpha 1 comma ellipsis comma alpha Subscript upper J minus 1 Baseline right-parenthesis colon 0 less-than-or-equal-to alpha 1 less-than-or-equal-to midline-horizontal-ellipsis less-than-or-equal-to alpha Subscript upper J minus 1 Baseline less-than-or-equal-to 1 EndSet

Denote the log-likelihood function as . Because the regression coefficients are not constrained, you can update them by using the one-step Newton-Raphson method as in the EM algorithm. Pan (1999) suggests using the ICM algorithm to update the baseline parameters ; doing so essentially treats as fixed and maximizes the function . Suppose that the maximum of occurs at . Mathematically, it can be proved that equals the maximizer of the following quadratic function,

g Superscript asterisk Baseline left-parenthesis bold x vertical-bar bold y comma bold upper W right-parenthesis equals minus one-half left-parenthesis bold x minus bold y right-parenthesis prime bold upper W left-parenthesis bold x minus bold y right-parenthesis

where , denotes the derivatives of with respect to , and is a positive definite matrix of size (Groeneboom and Wellner 1992).

The ICM algorithm updates as follows. For the dth iteration, the algorithm updates the quantity

bold y Superscript left-parenthesis d right-parenthesis Baseline equals ModifyingAbove bold-italic alpha With caret Superscript left-parenthesis d minus 1 right-parenthesis Baseline minus bold upper W Superscript negative 1 Baseline left-parenthesis ModifyingAbove bold-italic alpha With caret Superscript left-parenthesis d minus 1 right-parenthesis Baseline right-parenthesis nabla l left-parenthesis ModifyingAbove bold-italic alpha With caret Superscript left-parenthesis d minus 1 right-parenthesis Baseline right-parenthesis

where is the parameter estimate from the previous iteration and is a positive definite diagonal matrix that depends on . A convenient choice for is the negative of the second-order derivative of the log-likelihood function :

w Subscript j Baseline equals w Subscript j Baseline left-parenthesis bold-italic alpha right-parenthesis equals minus StartFraction partial-differential squared Over partial-differential alpha Subscript j Superscript 2 Baseline EndFraction l left-parenthesis bold-italic alpha right-parenthesis

Given and , the parameter estimate maximizes the quadratic function .

Define the cumulative sum diagram as a set of m points in the plane, where and

upper P Subscript k Baseline equals left-parenthesis sigma-summation Underscript i equals 1 Overscript k Endscripts w Subscript i Baseline comma sigma-summation Underscript i equals 1 Overscript k Endscripts w Subscript i Baseline y Subscript i Superscript left-parenthesis l right-parenthesis Baseline right-parenthesis

Technically, equals the left derivative of the convex minorant, or in other words, the largest convex function below the diagram . You can solve this optimization problem by using the pool-adjacent-violators algorithm (Groeneboom and Wellner 1992).

The EMICM algorithm combines the EM algorithm and the ICM algorithm by alternating the two different steps in its iterations. Whereas the EM step updates both the baseline parameters and the regression coefficients, the ICM step updates only the baseline parameters. If the ICM step does not increases the likelihood value, the parameter changes are halved for the next iteration. The process repeats a maximum of five times, until an increase in the likelihood value is found.

The EMICM algorithm is the default method of fitting the semiparametric model. You can use it to fit the piecewise constant hazard model by specifying the NLOPTIONS(TECH=EMICM) option in the PROC ICPHREG statement.

Last updated: March 08, 2022