catalytic_cox

Yitong Wu ywu039@e.ntu.edu.sg

Introduction

This vignette provides an overview of how to use the functions in the catalytic package that focuses on COX Regression. The other catalytic vignettes go into other model-estimating functions.

The goal of the catalytic package is to build framework for catalytic prior distributions. Stabilizing high-dimensional working models by shrinking them towards simplified models. This is achieved by supplementing observed data with weighted synthetic data generated from a predictive distribution under the simpler model. For more information, see (Li and Huang 2023).

The two steps of using catalytic package for COX Regression are

  1. Initialization: The cat_cox_initialization function constructs a cat_init object based on the formula provided by the user to generate synthetic data. The resulting cat_init object is tailored to facilitate further analysis, and is integral for subsequent modeling steps in the catalytic package.

  2. Choose Method(s): Users have the flexibility to choose from four main functions within the catalytic package: cat_cox, cat_cox_tune, cat_cox_bayes, and cat_cox_bayes_joint. Each function serves a specific purpose in modeling with catalytic priors and offers distinct capabilities tailored to different modeling scenarios for COX Regression. This approach enables users to seamlessly incorporate synthetic data with varying weights from different method into COX Regression analyses, providing flexibility and control over the modeling process.

Data Preparation

Creating a high-dimensional dataset with a low data size. This step involves increasing the number of features (dimensions) while keeping the number of observations (data size) relatively small. This is useful for testing the performance of catalytic models in high-dimensional settings.

The pbc dataset is loaded and split into training (train_data) and test (cox_cox_cat_test) datasets.

library(survival)
library(catalytic)

set.seed(1)

# Compute the Partial Likelihood for the Cox Proportional Hazards Model
get_cox_partial_likelihood <- function(X,
                                       time,
                                       status,
                                       coefs,
                                       entry_points) {
  # Identify the risk set indices for Cox proportional hazards model
  get_cox_risk_set_idx <- function(time_of_interest,
                                   entry_vector,
                                   time_vector,
                                   status_vector) {
    time_vector <- as.vector(time_vector)
    entry_vector <- as.vector(entry_vector)
    status_vector <- as.vector(status_vector)

    # Find indices where subjects are at risk at the given time of interest
    return(which((time_of_interest >= entry_vector) & (
      (time_vector == time_of_interest & status_vector == 1) | (
        time_vector + 1e-08 > time_of_interest))))
  }

  X <- as.matrix(X)
  time <- as.vector(time)
  status <- as.vector(status)

  pl <- 0

  # Calculate partial likelihood for each censored observation
  for (i in which(status == 1)) {
    risk_set_idx <- get_cox_risk_set_idx(
      time_of_interest = time[i],
      entry_vector = entry_points,
      time_vector = time,
      status_vector = status
    )

    # Compute the linear predictors for the risk set
    exp_risk_lp <- exp(X[risk_set_idx, , drop = FALSE] %*% coefs)

    # Update partial likelihood
    pl <- pl + sum(X[i, , drop = FALSE] * coefs) - log(sum(exp_risk_lp))
  }

  return(pl)
}

# Load pbc dataset for Cox proportional hazards model
data("pbc")

pbc <- pbc[stats::complete.cases(pbc), 2:20] # Remove NA values and `id` column

pbc$trt <- ifelse(pbc$trt == 1, 1, 0) # Convert trt to binary
pbc$sex <- ifelse(pbc$sex == "f", 1, 0) # Convert sex to binary
pbc$edema2 <- ifelse(pbc$edema == 0.5, 1, 0) # Convert edema2 to binary
pbc$edema <- ifelse(pbc$edema == 1, 1, 0) # Convert edema to binary
pbc$time <- pbc$time / 365 # Convert time to years
pbc$status <- (pbc$status == 2) * 1 # Convert status to binary

# Identify columns to be standardized
not_standarlized_index <- c(1, 2, 3, 5, 6, 7, 8, 9, 20)

# Standardize the columns
pbc[, -not_standarlized_index] <- scale(pbc[, -not_standarlized_index])

# Seperate observation data into train and test data
train_n <- (ncol(pbc) - 2) * 5
train_idx <- sample(1:nrow(pbc), train_n)
train_data <- pbc[train_idx, ]
test_data <- pbc[-train_idx, ]

test_x <- test_data[, -which(names(test_data) %in% c("time", "status"))]
test_time <- test_data$time
test_status <- test_data$status
test_entry_points <- rep(0, nrow(test_x))

dim(train_data)

In this section, we explore the foundation steps of fitting a COX regression model (GLM) using the survival::coxph function with the gaussian family.

We then proceed to evaluate model performance using the catalytic::get_discrepancy function, specifically calculating the logarithmic error discrepancy. This involves comparing the actual response (Y) against the estimated response (est_Y) derived from the COX regression model fitted on the train_data.

Finally, we display the computed logarithmic error to assess the model’s performance in predicting the response variable mpg.

# Calculate MPLE (Marginal Partial Likelihood Estimate) prediction score with 0 coefficients
MPLE_0 <- get_cox_partial_likelihood(
  X = test_x,
  time = test_time,
  status = test_status,
  coefs = rep(0, (ncol(test_x))),
  entry_points = test_entry_points
)

# Fit a COX regression model
cox_model <- survival::coxph(
  formula = survival::Surv(time, status) ~ .,
  data = train_data
)

# Calculate MPLE (Marginal Partial Likelihood Estimate) prediction score of estimated coefficients minus 0 coefficients
cat(
  "MLE COX Model - Predition Score:",
  get_cox_partial_likelihood(
    X = test_x,
    time = test_time,
    status = test_status,
    coefs = coef(cox_model),
    entry_points = test_entry_points
  ) - MPLE_0
)

Usage of catalytic

Step 1: Initialization

To initialize data for a Cox Proportional-Hazards Model using the catalytic package, the cat_cox_initialization function is employed. This function facilitates the setup by preparing synthetic data tailored for modeling purposes.

Here’s a breakdown of the parameters used:

cat_init <- cat_cox_initialization(
  formula = survival::Surv(time, status) ~ 1,
  data = train_data,
  syn_size = 100, # Default: 4 times the number of predictors
  hazard_constant = 1, # Default: calculated from observed data
  entry_points = rep(0, nrow(train_data)), # Default: Null, which is vector of zeros
  x_degree = NULL, # Default: NULL, which means degrees of all predictors are 1
  resample_only = FALSE, # Default: FALSE
  na_replace = mean # Default: stats::na.omit
)

cat_init

Here shows how users can simplify the input for cat_cox_initialization. User do not have to specify syn_size and other parameters, as they have default values, which mentioned above.

cat_init object contains a list of attributes, which is typically generated from the above function cat_cox_initialization. These attributes provide comprehensive information for subsequent modeling tasks or for user checks.

Here’s a breakdown of all attributes except the input parameters:

For more details, please check ?cat_cox_initialization.

names(cat_init)

And of course, user can extract items mentioned above from cat_cox_initialization object.

# The number of observations (rows) in the original dataset (`obs_data`)
cat_init$obs_size

# The information detailing the process of resampling synthetic data
cat_init$syn_x_resample_inform

Step 2.1: Choose Method(s) - Estimation with Fixed tau

The cat_cox function fits a Generalized COX Model (GLM) with a catalytic prior on the regression coefficients. It utilizes information from the cat_init object generated during the initialization step, which includes both observed and synthetic data, plus other relevant information.

The model is then fitted using the specified formula, family, and a single tau(synthetic data down-weight factor). The resulting cat_cox object encapsulates the fitted model, including estimated coefficients and family information, facilitating further analysis.

Here’s a breakdown of the parameters used:

To fit a Cox Proportional-Hazards Model using the catalytic package, the cat_cox function is employed. This function facilitates the setup by preparing synthetic data tailored for modeling purposes.

Here’s a breakdown of the parameters used:

cat_cox_model <- cat_cox(
  formula = survival::Surv(time, status) ~ ., # Same as `~ .`
  cat_init = cat_init,
  tau = 10, # Default: number of predictors
  method = "WME", # Default: CRE
  init_coefficients = NULL, # Default: all zeros
  tol = 0.01, # Default: 1e-5
  max_iter = 5 # Default: 25
)

cat_cox_model

Here shows how users can simplify the input for cat_cox. User do not have to specify tau and so on, as tau and other parameters has their own default value , which mentioned above.

Let’s check the prediction score.

cat(
  "Catalytic `cat_cox` - Predition Score:",
  get_cox_partial_likelihood(
    X = test_x,
    time = test_time,
    status = test_status,
    coefs = coef(cat_cox_model),
    entry_points = test_entry_points
  ) - MPLE_0
)

Both cat_cox_model and cat_cox_model objects are outputs from the cat_cox function, providing a list of attributes for further analysis or user inspection.

Here’s a breakdown of all attributes except the input parameters:

For more details, please check ?cat_cox.

names(cat_cox_model)

User can extract items mentioned above from cat_cox object.

# The formula used for modeling
cat_cox_model$formula

# The fitted model object obtained from `survival::coxph` with `tau`
cat_cox_model$coefficients

Step 2.2: Choose Method(s) - Estimation with Selective tau

The cat_cox_tune function fits a with a catalytic prior on the regression coefficients and provides options for optimizing model performance over a range of tau values(tau_seq).

These methods empower users to fit and optimize models with catalytic priors, leveraging both observed and synthetic data to enhance model performance and robustness in various statistical analyses.

Cross-validation (risk_estimate_method = “cross_validation”)

This method computes the partial likelihood across a specified range of tau values (tau_seq). It iterates through each tau value, evaluating its performance based on cross-validation folds (cross_validation_fold_num) to select the optimal tau that minimizes the discrepancy error.

Here’s a breakdown of the parameters used:

  • formula, cat_init and method are same as above.

  • tau_seq: Vector of positive numeric values for down-weighting synthetic data. Defaults to a sequence around the number of predictors.

  • cross_validation_fold_num: Number of folds for cross-validation. Defaults to 5.

  • ... (ellipsis): Additional arguments passed to the cat_cox function for model fitting.

cat_cox_tune_model <- cat_cox_tune(
  formula = survival::Surv(time, status) ~ ., # Same as `~ .`
  cat_init = cat_init,
  method = "CRE", # Default: "CRE"
  tau_seq = seq(1, 5, 0.5), # Default is a numeric sequence around the number of predictors
  cross_validation_fold_num = 3, # Default: 5
  tol = 1 # Additional arguments passed to the `cat_cox` function
)

cat_cox_tune_model

User can plot the tau_seq (x) against discrepancy error (y) using the plot() function. This plot will show the lowest discrepancy error at the optimal tau value.

plot(cat_cox_tune_model, text_pos = 2, legend_pos = "bottomright")

Here shows how users can simplify the input for cat_cox_tune. User do not have to specify tau_seq or some other augments, as most of the augments has default values, which mentioned above.

Let’s check the prediction score.

cat(
  "Catalytic `cat_cox_tune` - Predition Score:",
  get_cox_partial_likelihood(
    X = test_x,
    time = test_time,
    status = test_status,
    coefs = coef(cat_cox_tune_model),
    entry_points = test_entry_points
  ) - MPLE_0
)

cat_cox_tune_model and cat_cox_tune_model objects are outputs from the cat_cox_tune function, providing a list of attributes for further analysis or user inspection.

Here’s a breakdown of all attributes except the input parameters:

  • function_name: The name of the function (cat_cox_tune).

  • likelihood_list: The collected likelihood values for each tau.

  • tau: Selected optimal tau value from tau_seq that maximizes likelihood score.

  • model: The fitted COX model object obtained with the selected optimal tau (tau).

  • coefficients: The estimated coefficients from the fitted COX model (model).

For more details, please check ?cat_cox_tune.

names(cat_cox_tune_model)

User can extract items mentioned above from cat_cox_tune object.

# The estimated coefficients from the fitted model
cat_cox_tune_model$coefficients

# Selected optimal tau value from `tau_seq` that maximize the likelihood
cat_cox_tune_model$tau

Step 2.3: Bayesian Posterior Sampling with Fixed tau

Now, we will explore advanced Bayesian modeling techniques tailored for COX using the catalytic package. Bayesian inference offers a powerful framework to estimate model parameters and quantify uncertainty by integrating prior knowledge with observed data.

Below functions enable Bayesian inference for COX Regression Model with catalytic priors. This function utilizes Markov chain Monte Carlo (MCMC) methods, implemented using the rstan package, to sample from the posterior distribution of model parameters. Users can specify various options such as the number of MCMC chains (chains), iterations (iter) and warmup steps (warmup). User could also apply other attributes to rstan::sampling, like refresh and control.

In this section, we explore Bayesian approaches using the cat_cox_bayes function from the catalytic package. This function can fit a Bayesian Generalized COX Model (GLM) using a fixed tau value. The MCMC sampling process will generate posterior distributions for the coefficients based on the specified tau.

Here’s a breakdown of the parameters used:

For more details, please refer to ?cat_cox_bayes.

cat_cox_bayes_model <- cat_cox_bayes(
  formula = survival::Surv(time, status) ~ ., # Same as `~ .
  cat_init = cat_init,
  tau = 1, # Default: NULL
  chains = 1, # Default: 4
  iter = 100, # Default: 2000
  warmup = 50, # Default: 1000
  hazard_beta = 2 # Default: 2
)

cat_cox_bayes_model

Here shows how users can simplify the input for cat_cox_bayes. User do not have to specify tau and other attributes, as tau and other attributes have default value, which mentioned above. Here we assign lower value to chains, iter and warmup for quicker processing time.

User can also get the traceplot of the rstan model by using traceplot() directly into the output from cat_cox_bayes.

traceplot(cat_cox_bayes_model)

Plus, user can use this catlaytic::traceplot just like the rstan::traceplot, user can add parameters used in rstan::traceplot, like include and inc_warmup.

traceplot(cat_cox_bayes_model, inc_warmup = TRUE)

Let’s check the prediction score.

cat(
  "Catalytic `cat_cox_bayes` - Predition Score:",
  get_cox_partial_likelihood(
    X = test_x,
    time = test_time,
    status = test_status,
    coefs = coef(cat_cox_bayes_model),
    entry_points = test_entry_points
  ) - MPLE_0
)

Both cat_cox_bayes_model and cat_cox_bayes_model objects are outputs from the cat_cox_bayes function, providing a list of attributes for further analysis or user inspection.

Here’s a breakdown of all attributes except the input parameters:

For more details, please refer to ?cat_cox_bayes.

names(cat_cox_bayes_model)

User can extract items mentioned above from cat_cox_bayes object.

# The number of Markov chains used during MCMC sampling in `rstan`.
cat_cox_bayes_model$chain

# The mean estimated coefficients from the Bayesian model
cat_cox_bayes_model$coefficients

2.4 Bayesian Posterior Sampling with Adaptive Tau

In this section, we delve into Bayesian methodologies employing the cat_cox_bayes_joint function within the catalytic package. Unlike its non-adaptive counterpart (cat_cox_bayes), this method employs a joint tau prior approach where tau is treated as a parameter within the MCMC sampling process, improving the robustness and accuracy of parameter estimation in Bayesian gaussian modeling.

In this section, we explore Bayesian approaches using the cat_cox_bayes_joint function from the catalytic package. These functions are similar to their non-adaptive (non-joint) version (cat_cox_bayes), but corporate tau into the MCMC sampling process.

Here’s a breakdown of the parameters used:

For more details, please refer to ?cat_cox_bayes_joint.

cat_cox_bayes_joint_model <- cat_cox_bayes_joint(
  formula = survival::Surv(time, status) ~ ., # Same as `~ .
  cat_init = cat_init,
  chains = 1, # Default: 4
  iter = 100, # Default: 2000
  warmup = 50, # Default: 1000
  tau_alpha = 2, # Default: 2
  tau_gamma = 1, # Default: 1
  hazard_beta = 2 # Default: 2
)

cat_cox_bayes_joint_model

Here shows how users can simplify the input for cat_cox_bayes_joint. User do not have to specify chains and other attributes, as they have default values, which mentioned above. Here we assign lower value to chains, iter and warmup for quicker processing time.

User can also get the traceplot of the rstan model by using traceplot() directly into the output from cat_cox_bayes_joint.

traceplot(cat_cox_bayes_joint_model)

Like the traceplot shown in the cat_cox_bayes function , user can add parameters used in rstan::traceplot, like include and inc_warmup.

traceplot(cat_cox_bayes_joint_model, inc_warmup = TRUE)

Let’s check the prediction score.

cat(
  "Catalytic `cat_cox_bayes_joint` - Predition Score:",
  get_cox_partial_likelihood(
    X = test_x,
    time = test_time,
    status = test_status,
    coefs = coef(cat_cox_bayes_joint_model),
    entry_point = test_entry_points
  ) - MPLE_0
)

Both cat_cox_bayes_joint_model and cat_cox_bayes_joint_model objects are outputs from the cat_cox_bayes_joint function, providing a list of attributes for further analysis or user inspection.

Here’s a breakdown of all attributes except the input parameters:

For more details, please refer to ?cat_cox_bayes_joint.

names(cat_cox_bayes_joint_model)

User can extract items mentioned above from cat_cox_bayes_joint object.

# The estimated tau parameter from the MCMC sampling `rstan::sampling`,
cat_cox_bayes_joint_model$tau

# The mean estimated coefficients from the Bayesian model
cat_cox_bayes_joint_model$coefficients

References

Li, Weihao, and Dongming Huang. 2023. “Yesian Inference on Cox Regression Models Using Catalytic Prior Distributions.” ArXiv. https://arxiv.org/abs/2312.01411.