This vignette provides an overview of how to use the functions in the catalytic package that focuses on COX Regression. The other catalytic vignettes go into other model-estimating functions.
The goal of the catalytic package is to build framework for catalytic prior distributions. Stabilizing high-dimensional working models by shrinking them towards simplified models. This is achieved by supplementing observed data with weighted synthetic data generated from a predictive distribution under the simpler model. For more information, see (Li and Huang 2023).
The two steps of using catalytic package for COX Regression are
Initialization: The cat_cox_initialization
function
constructs a cat_init
object based on the formula provided
by the user to generate synthetic data. The resulting
cat_init
object is tailored to facilitate further analysis,
and is integral for subsequent modeling steps in the
catalytic
package.
Choose Method(s): Users have the flexibility to choose from four
main functions within the catalytic package: cat_cox
,
cat_cox_tune
, cat_cox_bayes
, and
cat_cox_bayes_joint
. Each function serves a specific
purpose in modeling with catalytic priors and offers distinct
capabilities tailored to different modeling scenarios for COX
Regression. This approach enables users to seamlessly incorporate
synthetic data with varying weights from different method into COX
Regression analyses, providing flexibility and control over the modeling
process.
Creating a high-dimensional dataset with a low data size. This step
involves increasing the number of features (dimensions) while keeping
the number of observations (data size) relatively small. This is useful
for testing the performance of catalytic
models in
high-dimensional settings.
The pbc
dataset is loaded and split into training
(train_data
) and test (cox_cox_cat_test
)
datasets.
library(survival)
library(catalytic)
set.seed(1)
# Compute the Partial Likelihood for the Cox Proportional Hazards Model
get_cox_partial_likelihood <- function(X,
time,
status,
coefs,
entry_points) {
# Identify the risk set indices for Cox proportional hazards model
get_cox_risk_set_idx <- function(time_of_interest,
entry_vector,
time_vector,
status_vector) {
time_vector <- as.vector(time_vector)
entry_vector <- as.vector(entry_vector)
status_vector <- as.vector(status_vector)
# Find indices where subjects are at risk at the given time of interest
return(which((time_of_interest >= entry_vector) & (
(time_vector == time_of_interest & status_vector == 1) | (
time_vector + 1e-08 > time_of_interest))))
}
X <- as.matrix(X)
time <- as.vector(time)
status <- as.vector(status)
pl <- 0
# Calculate partial likelihood for each censored observation
for (i in which(status == 1)) {
risk_set_idx <- get_cox_risk_set_idx(
time_of_interest = time[i],
entry_vector = entry_points,
time_vector = time,
status_vector = status
)
# Compute the linear predictors for the risk set
exp_risk_lp <- exp(X[risk_set_idx, , drop = FALSE] %*% coefs)
# Update partial likelihood
pl <- pl + sum(X[i, , drop = FALSE] * coefs) - log(sum(exp_risk_lp))
}
return(pl)
}
# Load pbc dataset for Cox proportional hazards model
data("pbc")
pbc <- pbc[stats::complete.cases(pbc), 2:20] # Remove NA values and `id` column
pbc$trt <- ifelse(pbc$trt == 1, 1, 0) # Convert trt to binary
pbc$sex <- ifelse(pbc$sex == "f", 1, 0) # Convert sex to binary
pbc$edema2 <- ifelse(pbc$edema == 0.5, 1, 0) # Convert edema2 to binary
pbc$edema <- ifelse(pbc$edema == 1, 1, 0) # Convert edema to binary
pbc$time <- pbc$time / 365 # Convert time to years
pbc$status <- (pbc$status == 2) * 1 # Convert status to binary
# Identify columns to be standardized
not_standarlized_index <- c(1, 2, 3, 5, 6, 7, 8, 9, 20)
# Standardize the columns
pbc[, -not_standarlized_index] <- scale(pbc[, -not_standarlized_index])
# Seperate observation data into train and test data
train_n <- (ncol(pbc) - 2) * 5
train_idx <- sample(1:nrow(pbc), train_n)
train_data <- pbc[train_idx, ]
test_data <- pbc[-train_idx, ]
test_x <- test_data[, -which(names(test_data) %in% c("time", "status"))]
test_time <- test_data$time
test_status <- test_data$status
test_entry_points <- rep(0, nrow(test_x))
dim(train_data)
In this section, we explore the foundation steps of fitting a COX
regression model (GLM) using the survival::coxph
function
with the gaussian family.
We then proceed to evaluate model performance using the
catalytic::get_discrepancy
function, specifically
calculating the logarithmic error discrepancy. This involves comparing
the actual response (Y
) against the estimated response
(est_Y
) derived from the COX regression model fitted on the
train_data
.
Finally, we display the computed logarithmic error to assess the
model’s performance in predicting the response variable
mpg
.
# Calculate MPLE (Marginal Partial Likelihood Estimate) prediction score with 0 coefficients
MPLE_0 <- get_cox_partial_likelihood(
X = test_x,
time = test_time,
status = test_status,
coefs = rep(0, (ncol(test_x))),
entry_points = test_entry_points
)
# Fit a COX regression model
cox_model <- survival::coxph(
formula = survival::Surv(time, status) ~ .,
data = train_data
)
# Calculate MPLE (Marginal Partial Likelihood Estimate) prediction score of estimated coefficients minus 0 coefficients
cat(
"MLE COX Model - Predition Score:",
get_cox_partial_likelihood(
X = test_x,
time = test_time,
status = test_status,
coefs = coef(cox_model),
entry_points = test_entry_points
) - MPLE_0
)
catalytic
To initialize data for a Cox Proportional-Hazards Model using the
catalytic
package, the cat_cox_initialization
function is employed. This function facilitates the setup by preparing
synthetic data tailored for modeling purposes.
Here’s a breakdown of the parameters used:
formula
: Specifies the model formula, indicating the
time and status variables used for the Cox model. See
?survival::coxph
for details on formula syntax.
data
: Represents the original dataset, structured as
a data.frame
.
syn_size
: Determines the size of the synthetic data
generated, defaulted to four times the number of columns in the original
data.
hazard_constant
: A constant used in the hazard rate
calculation. Default is the sum of observed statuses divided by the mean
observed time divided by the number of observations.
entry_points
: An optional vector specifying the
entry points for the data. Default is a vector of zeros.
x_degree
: Determines the degree of the polynomial
terms for the predictors in the model, which affects the diversity or
complexity of the predictor terms included in the model. Defaulted to
NULL
, which means the degree for all predictors is
1.
resample_only
: Determines whether synthetic data is
exclusively generated by resampling from the original dataset. By
default, it is set to FALSE
, allowing the function to apply
various resampling strategies based on the characteristics of
dataset.
na_replace
: Specifies the method for handling
missing values in the dataset, defaulted to
stats::na.omit
.
...
(ellipsis): Allows users to specify additional
arguments that are passed directly to the simple model. This model is
used within the function to generate synthetic responses by fitting
observation data using survival::coxph
. By leveraging
...
, users can customize the behavior of
survival::coxph
within cat_cox_initialization
with additional options.
cat_init <- cat_cox_initialization(
formula = survival::Surv(time, status) ~ 1,
data = train_data,
syn_size = 100, # Default: 4 times the number of predictors
hazard_constant = 1, # Default: calculated from observed data
entry_points = rep(0, nrow(train_data)), # Default: Null, which is vector of zeros
x_degree = NULL, # Default: NULL, which means degrees of all predictors are 1
resample_only = FALSE, # Default: FALSE
na_replace = mean # Default: stats::na.omit
)
cat_init
Here shows how users can simplify the input for
cat_cox_initialization
. User do not have to specify
syn_size
and other parameters, as they have default values,
which mentioned above.
cat_init
object contains a list of attributes, which is
typically generated from the above function cat_cox_initialization
. These
attributes provide comprehensive information for subsequent modeling
tasks or for user checks.
Here’s a breakdown of all attributes except the input parameters:
function_name
: The name of this function, which is
cat_cox_initialization.
obs_size
: The number of observations (rows) in the
original dataset (obs_data
).
obs_data
: The original dataset
(data
).
obs_x
: The covariates from the original dataset
(obs_data
).
obs_time
: The time variable from the original
dataset (obs_data
).
obs_status
: The status variable from the original
dataset (obs_data
).
syn_size
: The size of synthetic data
generated.
syn_data
: The synthetic data created for modeling
purposes, based on the original dataset (obs_data
)
characteristics.
syn_x
: The covariates from the synthetic dataset
(syn_data
).
syn_time
: The time variable from the synthetic
dataset (syn_data
).
syn_status
: The status variable from the synthetic
dataset (syn_data
).
syn_x_resample_inform
: The information detailing the
process of resampling synthetic data.
size
: The total size of the combined dataset
(obs_size
and syn_size
).
data
: The combined dataset (obs_data
and syn_data
).
x
: The combined covariates (obs_x
and
syn_x
).
time
: The combined time variable
(obs_time
and syn_time
).
status
: The combined status variable
(obs_status
and syn_status
).
For more details, please check
?cat_cox_initialization
.
And of course, user can extract items mentioned above from
cat_cox_initialization
object.
The cat_cox
function fits a Generalized COX Model (GLM)
with a catalytic prior on the regression coefficients. It utilizes
information from the cat_init
object generated during the
initialization step, which includes both observed and synthetic data,
plus other relevant information.
The model is then fitted using the specified formula, family, and a
single tau
(synthetic data down-weight factor). The
resulting cat_cox
object encapsulates the fitted model,
including estimated coefficients and family information, facilitating
further analysis.
Here’s a breakdown of the parameters used:
To fit a Cox Proportional-Hazards Model using the
catalytic
package, the cat_cox
function is
employed. This function facilitates the setup by preparing synthetic
data tailored for modeling purposes.
Here’s a breakdown of the parameters used:
formula
: This parameter specifies the model formula
used in the Cox model. It defines the relationship between the response
variable and the predictors. Alternatively, besides using in format
survival:Surv(TIME, STATUS) ~ COVARIATES
, user can also use
~ COVARIATES
without specifying the response name, since
the response name is defined in the initialization step.
cat_init
: This parameter is essential and represents
the initialization object (cat_init
) created by using cat_cox_initialization
. It
contains both observed and synthetic data, plus other relevant
information, necessary for model fitting.
tau
: This parameter determines the down-weight
assigned to synthetic data relative to observed data. It influences the
influence of synthetic data in the model fitting process. If not
specified (NULL
), it defaults to a one forth of the number
of predictors.
method
: The method for fitting the Cox model, either
“CRE” (Catalytic-regularized Estimator) or “WME” (weighted Mixture
Estimator). Default is “CRE”.
init_coefficients
: Initial coefficient values before
iteration. Defaults to zero if not provided (if using
CRE
).
tol
: Convergence tolerance for iterative methods.
Default is 1e-5
(if using CRE
).
max_iter
: Maximum number of iterations allowed for
convergence. Default is 25
(if using
CRE
).
cat_cox_model <- cat_cox(
formula = survival::Surv(time, status) ~ ., # Same as `~ .`
cat_init = cat_init,
tau = 10, # Default: number of predictors
method = "WME", # Default: CRE
init_coefficients = NULL, # Default: all zeros
tol = 0.01, # Default: 1e-5
max_iter = 5 # Default: 25
)
cat_cox_model
Here shows how users can simplify the input for cat_cox
.
User do not have to specify tau
and so on, as
tau
and other parameters has their own default value ,
which mentioned above.
Let’s check the prediction score.
cat(
"Catalytic `cat_cox` - Predition Score:",
get_cox_partial_likelihood(
X = test_x,
time = test_time,
status = test_status,
coefs = coef(cat_cox_model),
entry_points = test_entry_points
) - MPLE_0
)
Both cat_cox_model
and cat_cox_model
objects are outputs from the cat_cox
function, providing a
list of attributes for further analysis or user inspection.
Here’s a breakdown of all attributes except the input parameters:
function_name
: The name of this function, which is
cat_cox.
coefficients
: The coefficients of the fitted Cox
model.
model
: The fitted survival model (only for “WME”
method).
iter
: The number of iterations performed (only for
“CRE” method).
iteration_log
: The collected coefficients for each
iteration (only for “CRE” method).
For more details, please check ?cat_cox
.
User can extract items mentioned above from cat_cox
object.
The cat_cox_tune
function fits a with a catalytic prior
on the regression coefficients and provides options for optimizing model
performance over a range of tau values(tau_seq
).
These methods empower users to fit and optimize models with catalytic priors, leveraging both observed and synthetic data to enhance model performance and robustness in various statistical analyses.
This method computes the partial likelihood across a specified range
of tau values (tau_seq
). It iterates through each tau
value, evaluating its performance based on cross-validation folds
(cross_validation_fold_num
) to select the optimal tau that
minimizes the discrepancy error.
Here’s a breakdown of the parameters used:
formula
, cat_init
and
method
are same as above.
tau_seq
: Vector of positive numeric values for
down-weighting synthetic data. Defaults to a sequence around the number
of predictors.
cross_validation_fold_num
: Number of folds for
cross-validation. Defaults to 5.
...
(ellipsis): Additional arguments passed to the
cat_cox
function for model fitting.
cat_cox_tune_model <- cat_cox_tune(
formula = survival::Surv(time, status) ~ ., # Same as `~ .`
cat_init = cat_init,
method = "CRE", # Default: "CRE"
tau_seq = seq(1, 5, 0.5), # Default is a numeric sequence around the number of predictors
cross_validation_fold_num = 3, # Default: 5
tol = 1 # Additional arguments passed to the `cat_cox` function
)
cat_cox_tune_model
User can plot the tau_seq (x) against discrepancy error (y) using the
plot()
function. This plot will show the lowest discrepancy
error at the optimal tau value.
Here shows how users can simplify the input for
cat_cox_tune
. User do not have to specify
tau_seq
or some other augments, as most of the augments has
default values, which mentioned above.
Let’s check the prediction score.
cat(
"Catalytic `cat_cox_tune` - Predition Score:",
get_cox_partial_likelihood(
X = test_x,
time = test_time,
status = test_status,
coefs = coef(cat_cox_tune_model),
entry_points = test_entry_points
) - MPLE_0
)
cat_cox_tune_model
and cat_cox_tune_model
objects are outputs from the cat_cox_tune
function,
providing a list of attributes for further analysis or user
inspection.
Here’s a breakdown of all attributes except the input parameters:
function_name
: The name of the function
(cat_cox_tune
).
likelihood_list
: The collected likelihood values for
each tau.
tau
: Selected optimal tau value from
tau_seq
that maximizes likelihood score.
model
: The fitted COX model object obtained with the
selected optimal tau (tau
).
coefficients
: The estimated coefficients from the
fitted COX model (model
).
For more details, please check ?cat_cox_tune
.
User can extract items mentioned above from cat_cox_tune
object.
Now, we will explore advanced Bayesian modeling techniques tailored
for COX using the catalytic
package. Bayesian inference
offers a powerful framework to estimate model parameters and quantify
uncertainty by integrating prior knowledge with observed data.
Below functions enable Bayesian inference for COX Regression Model
with catalytic priors. This function utilizes Markov chain Monte Carlo
(MCMC) methods, implemented using the rstan
package, to
sample from the posterior distribution of model parameters. Users can
specify various options such as the number of MCMC chains
(chains
), iterations (iter
) and warmup steps
(warmup
). User could also apply other attributes to
rstan::sampling
, like refresh
and
control
.
In this section, we explore Bayesian approaches using the
cat_cox_bayes
function from the catalytic
package. This function can fit a Bayesian Generalized COX Model (GLM)
using a fixed tau value. The MCMC sampling process will generate
posterior distributions for the coefficients based on the specified
tau.
Here’s a breakdown of the parameters used:
formula
, cat_init
and tau
are same as above.
chains
: Number of Markov chains to run during MCMC
sampling in rstan::sampling
. Defaults to 4.
iter
: Total number of iterations per chain in the
MCMC sampling process in rstan::sampling
. Defaults to
2000.
warmup
: Number of warmup iterations in the MCMC
sampling process in rstan::sampling
, discarded as burn-in.
Defaults to 1000.
hazard_beta
: Shape parameter for the Gamma
distribution in the hazard model. Defaults to 2.
...
(ellipsis): Denotes additional arguments that
can be passed directly to the underlying rstan::sampling
function used within cat_cox_bayes
to fit the Bayesian
model. These arguments allow for customization of the Bayesian fitting
process, such as control
, refresh
, or other
model-specific settings.
For more details, please refer to ?cat_cox_bayes
.
cat_cox_bayes_model <- cat_cox_bayes(
formula = survival::Surv(time, status) ~ ., # Same as `~ .
cat_init = cat_init,
tau = 1, # Default: NULL
chains = 1, # Default: 4
iter = 100, # Default: 2000
warmup = 50, # Default: 1000
hazard_beta = 2 # Default: 2
)
cat_cox_bayes_model
Here shows how users can simplify the input for
cat_cox_bayes
. User do not have to specify tau
and other attributes, as tau
and other attributes have
default value, which mentioned above. Here we assign lower value to
chains
, iter
and warmup
for
quicker processing time.
User can also get the traceplot of the rstan
model by
using traceplot()
directly into the output from
cat_cox_bayes
.
Plus, user can use this catlaytic::traceplot
just like
the rstan::traceplot
, user can add parameters used in
rstan::traceplot
, like include
and
inc_warmup
.
Let’s check the prediction score.
cat(
"Catalytic `cat_cox_bayes` - Predition Score:",
get_cox_partial_likelihood(
X = test_x,
time = test_time,
status = test_status,
coefs = coef(cat_cox_bayes_model),
entry_points = test_entry_points
) - MPLE_0
)
Both cat_cox_bayes_model
and
cat_cox_bayes_model
objects are outputs from the
cat_cox_bayes
function, providing a list of attributes for
further analysis or user inspection.
Here’s a breakdown of all attributes except the input parameters:
function_name
: The name of the function
(cat_cox_bayes
).
stan_data
: The data list used for
rstan::sampling
.
stan_model
: The rstan::stan_model
object used for Bayesian modeling.
stan_sample_model
: The result object obtained from
rstan::sampling
, encapsulating the MCMC sampling
results.
coefficients
: The mean estimated coefficients from
the Bayesian model, extracted from
rstan::summary(stan_sample_model)$summary
.
increment_cumulative_baseline_hazard
: The mean of
the increment in cumulative baseline hazard extracted from
rstan::summary(stan_sample_model)$summary
.
For more details, please refer to ?cat_cox_bayes
.
User can extract items mentioned above from
cat_cox_bayes
object.
In this section, we delve into Bayesian methodologies employing the
cat_cox_bayes_joint
function within the
catalytic
package. Unlike its non-adaptive counterpart
(cat_cox_bayes
), this method employs a joint tau prior
approach where tau is treated as a parameter within the MCMC sampling
process, improving the robustness and accuracy of parameter estimation
in Bayesian gaussian modeling.
In this section, we explore Bayesian approaches using the
cat_cox_bayes_joint
function from the
catalytic
package. These functions are similar to their
non-adaptive (non-joint) version (cat_cox_bayes
), but
corporate tau
into the MCMC sampling process.
Here’s a breakdown of the parameters used:
formula
and cat_init
are same as above.
chains
, iter
, warmup
and
hazard_beta
are same in section Bayesian Posterior Sampling with Fixed
Tau.
tau_alpha
: Alpha parameter controlling degrees of
freedom for distribution in the joint tau approach. Default is
2.
tau_gamma
: Gamma parameter in the joint tau
approach. Default is 1.
...
(ellipsis): Denotes additional arguments that can
be passed directly to the underlying rstan::sampling
function used within cat_cox_bayes_joint
to fit the
Bayesian model. These arguments allow for customization of the Bayesian
fitting process, such as control
, refresh
, or
other model-specific settings.
For more details, please refer to
?cat_cox_bayes_joint
.
cat_cox_bayes_joint_model <- cat_cox_bayes_joint(
formula = survival::Surv(time, status) ~ ., # Same as `~ .
cat_init = cat_init,
chains = 1, # Default: 4
iter = 100, # Default: 2000
warmup = 50, # Default: 1000
tau_alpha = 2, # Default: 2
tau_gamma = 1, # Default: 1
hazard_beta = 2 # Default: 2
)
cat_cox_bayes_joint_model
Here shows how users can simplify the input for
cat_cox_bayes_joint
. User do not have to specify
chains
and other attributes, as they have default values,
which mentioned above. Here we assign lower value to
chains
, iter
and warmup
for
quicker processing time.
User can also get the traceplot of the rstan
model by
using traceplot()
directly into the output from
cat_cox_bayes_joint
.
Like the traceplot
shown in the
cat_cox_bayes
function , user can add parameters used
in rstan::traceplot
, like include
and
inc_warmup
.
Let’s check the prediction score.
cat(
"Catalytic `cat_cox_bayes_joint` - Predition Score:",
get_cox_partial_likelihood(
X = test_x,
time = test_time,
status = test_status,
coefs = coef(cat_cox_bayes_joint_model),
entry_point = test_entry_points
) - MPLE_0
)
Both cat_cox_bayes_joint_model
and
cat_cox_bayes_joint_model
objects are outputs from the
cat_cox_bayes_joint
function, providing a list of
attributes for further analysis or user inspection.
Here’s a breakdown of all attributes except the input parameters:
function_name
: The name of the function
(cat_cox_bayes_joint
).
tau
: The estimated tau parameter from the MCMC
sampling rstan::sampling
, depending on the model
configuration.
stan_data
, stan_model
,
stan_sample_model
, coefficients
and
increment_cumulative_baseline_hazard
are same in section Bayesian Posterior Sampling with Fixed
Tau.
For more details, please refer to
?cat_cox_bayes_joint
.
User can extract items mentioned above from
cat_cox_bayes_joint
object.