Linear Biomarker Combination: Empirical Performance Optimization

Yijian Huang (yhuang5@emory.edu)

lincom implements linear combination methods for biomarkers via empirical performance optimization with respect to two performance metrics: (1) specificity at controlled sensitivity (or sensitivity at controlled specificity) (Huang and Sanda, 2022), and (2) weighted average of false positive rate and false negative rate. The second method is a variant of the maximum score estimator (Manski, 1975, 1985). In both cases, the algorithm of Huang and Sanda (2022) is used to provide a solution that balances between computational efficiency and quality.

Installation

lincom is available on CRAN:

install.packages("lincom")

‘MOSEK’ solver is used and needs to be installed; an academic license for ‘MOSEK’ is free.

Simulated dataset for illustration

library(lincom)
## simulate 3 biomarkers for 100 cases and 100 controls
n1 <- 100
n0 <- 100
mk <- rbind(matrix(rnorm(3*n1),ncol=3),matrix(rnorm(3*n0),ncol=3))
mk[1:n1,1] <- mk[1:n1,1]/sqrt(2)+1
mk[1:n1,2] <- mk[1:n1,2]*sqrt(2)+1

mk[1:n1,] and mk[(n1+1):(n1+n0),] contain the case and control data, respectively.

Empirical maximization of specificity at controlled sensitivity (or sensitivity at controlled specificity)

The following code performs empirical maximization of specificity at 95% sensitivity.

## The following two lines are commented out - require installation of 'MOSEK' to run
#lcom1 <- eum(mk, n1=n1, s0=0.95, grdpt=0)
#lcom1

Above, n1 is the case size, s0 is the control level, and grdpt specifies how initial value of the optimization is obtained (logistic regression if grdpt=0, and coarse grid search with grdpt grid points otherwise). Additional arguments include fixsens (fixing sensitivity if TRUE and specificity otherwise), and lbmdis (larger biomarker values is more associated with cases if TRUE and controls otherwise).

The outputs include the resulting combination coefficient (coef), maximum empirical value of the performance metric (hs), and the resulting threshold (threshold), along with their initial value counterparts (from logistic regression or coarse grid search).

Empirical minimization of weighted average of false positive rate and false negative rate

## default relative weight r=1.
## Require installation of 'MOSEK' to run
## The following two lines are commented out - require installation of 'MOSEK' to run
#lcom2 <- wmse(mk, n1=n1)
#lcom2

The inputs and outputs are similar to those of eum. However, the initial value here is obtained through logistic regression only.

With cohort design, setting r=n0/n1 leads to Manski’s original estimator.

References

Huang, Y. and Sanda, M. G. (2022). Linear biomarker combination for constrained classification. The Annals of Statistics 50, 2793–2815.

Manski, C. F. (1975). Maximum score estimation of the stochastic utility model of choice. Journal of Econometrics 3, 205–228.

Manski, C. F. (1985). Semiparametric analysis of discrete response. Asymptotic properties of the maximum score estimator. Journal of Econometrics 27, 313–333.