The expectation-maximization (EM) algorithm, as described in Wang et al. (2016) and Zeng, Mao, and Lin (2016), can be used to fit certain types of proportional hazards models to interval-censored data.
Suppose that the observations to be analyzed consist of interval-censored outcomes ,
, where n is the number of subjects.
denotes a p-dimensional vector of covariates for the ith subject.
Assuming that there is no exact observation (), the full likelihood function is
where indicates whether the ith subject is left-censored (
),
indicates whether the ith subject is interval-censored (
), and
indicates whether the ith subject is right-censored (
).
Assume that the baseline hazard function is of the following form,
where are known functions that are nondecreasing and nonnegative, and
are nonnegative baseline parameters.
Let be a set of latent variables that follow Poisson distributions with means
. Let
be a set of latent variables that follow Poisson distributions with means
. Define
and
.
The full likelihood can be rewritten as
The complete-data likelihood is
where denotes the Poisson probability mass function for the variable V. It is straightforward to verify that the integration of
with respect to latent variables leads to the full likelihood
.
The EM algorithm proceeds as follows. Let the current parameter estimates be . Define
The expected complete-data log likelihood is computed as
The quantities and
are computed as follows,
where
Solve for
. It follows that the
can be updated as
The partial derivative of with respect to
is
After plugging in , you can update the parameters
by using the one-step Newton-Raphson method (Zeng, Mao, and Lin 2016).
The EM algorithm alternates between updating and updating
until convergence.
You can use the EM algorithm to fit the semiparametric model and the piecewise constant hazard model in PROC ICPHREG. The option is NLOPTIONS(TECH=EM) in the PROC ICPHREG statement.
A typical way that interval-censored data are generated is through a process of repeated assessments. Suppose that are a random sequence of assessment times. Denote
and
, where
.
For the ith subject, , let
,
,
,
, and
be the number of assessments, event time, assessment time vector, the indicator vector, and time-dependent covariate process, respectively. Suppose that
is interval-censored between two assessment times,
and
, where
and
. Let
be the sorted right boundaries of the Turnbull intervals for
.
Suppose that the time-dependent covariates process change value only at assessment times. Let
be the observed covariate vectors at times
.
Under the semiparametric model, the baseline cumulative hazard function is
For the ith subject, the cumulative hazard function is computed as
The full likelihood function is
where indicates whether the ith subject is left-censored (
),
indicates whether the ith subject is interval-censored (
), and
indicates whether the ith subject is right-censored (
).
As the following derivation shows, the EM algorithm can be adapted straightforwardly to fit the semiparametric model that contains time-dependent covariates.
Let , and redefine the latent Poisson variables as
The expected complete-data log likelihood becomes
where is a constant and
and
are computed as follows:
You use the ID statement to fit the semiparametric model that contains time-dependent covariates. The levels of the ID variable identify the subjects to be analyzed.
Let be the maximum likelihood estimates as found by the EM algorithm. Under suitable conditions, you can apply Louis’s method (Louis 1982) to obtain the covariance matrix of
.
The observed information matrix is computed as
and its inverse is the estimated covariance of
.
Louis’s method is the default method of calculating standard errors for the semiparametric model.
You can use the profile likelihood method of Murphy and Van Der Vaart (2000) to estimate the covariance matrix of . The profile log-likelihood function is defined as
where is the parameter space of
.
The Hessian matrix of can be computed using numerical differentiation. Let
be the identity vector for
, and let l be a small perturbation. The
th element of the Hessian matrix can be approximated by
The covariance matrix of is estimated by inverting the negative of the Hessian matrix.
You can use the profile likelihood method for the semiparametric model by specifying the PLVARIANCE option in the MODEL statement. But be aware that this computation is iterative and can consume a relatively large amount of CPU time.
Pan (1999) proposes using the iterative convex minorant (ICM) algorithm to fit semiparametric proportional hazards models to interval-censored data.
Define . Denote
and
. The full likelihood function can be rewritten in terms of
and the regression coefficients
.
Maximizing the likelihood with respect to is equivalent to maximizing it with respect to
. Because the
are naturally ordered, the optimization is subject to the following constraint:
Denote the log-likelihood function as . Because the regression coefficients
are not constrained, you can update them by using the one-step Newton-Raphson method as in the EM algorithm. Pan (1999) suggests using the ICM algorithm to update the baseline parameters
; doing so essentially treats
as fixed and maximizes the function
. Suppose that the maximum of
occurs at
. Mathematically, it can be proved that
equals the maximizer of the following quadratic function,
where ,
denotes the derivatives of
with respect to
, and
is a positive definite matrix of size
(Groeneboom and Wellner 1992).
The ICM algorithm updates as follows. For the dth iteration, the algorithm updates the quantity
where is the parameter estimate from the previous iteration and
is a positive definite diagonal matrix that depends on
. A convenient choice for
is the negative of the second-order derivative of the log-likelihood function
:
Given and
, the parameter estimate
maximizes the quadratic function
.
Define the cumulative sum diagram as a set of m points in the plane, where
and
Technically, equals the left derivative of the convex minorant, or in other words, the largest convex function below the diagram
. You can solve this optimization problem by using the pool-adjacent-violators algorithm (Groeneboom and Wellner 1992).
The EMICM algorithm combines the EM algorithm and the ICM algorithm by alternating the two different steps in its iterations. Whereas the EM step updates both the baseline parameters and the regression coefficients, the ICM step updates only the baseline parameters. If the ICM step does not increases the likelihood value, the parameter changes are halved for the next iteration. The process repeats a maximum of five times, until an increase in the likelihood value is found.
The EMICM algorithm is the default method of fitting the semiparametric model. You can use it to fit the piecewise constant hazard model by specifying the NLOPTIONS(TECH=EMICM) option in the PROC ICPHREG statement.